LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

[link] Rational Animations' intro to mechanistic interpretability
Writer · 2024-06-14T16:10:57.015Z · comments (1)

Principled Satisficing To Avoid Goodhart
JenniferRM · 2024-08-16T19:05:27.204Z · comments (2)

AI #54: Clauding Along
Zvi · 2024-03-07T16:00:05.066Z · comments (11)

Startup Roundup #2
Zvi · 2024-08-06T13:30:06.554Z · comments (0)

Atlantis: Berkeley event venue available for rent
Jonas V (Jonas Vollmer) · 2023-11-22T01:47:12.026Z · comments (0)

[link] I'd also take $7 trillion
bhauth · 2024-02-19T03:31:45.552Z · comments (12)

[link] AlphaGeometry: An Olympiad-level AI system for geometry
alyssavance · 2024-01-17T17:17:30.913Z · comments (9)

AI #36: In the Background
Zvi · 2023-11-02T18:00:01.803Z · comments (5)

[question] Where might I direct promising-to-me researchers to apply for alignment jobs/grants?
abramdemski · 2023-09-18T16:20:03.452Z · answers+comments (10)

Some open-source dictionaries and dictionary learning infrastructure
Sam Marks (samuel-marks) · 2023-12-05T06:05:21.903Z · comments (7)

Monthly Roundup #18: May 2024
Zvi · 2024-05-13T12:30:04.863Z · comments (10)

AI #32: Lie Detector
Zvi · 2023-10-05T13:50:05.030Z · comments (19)

AI #53: One More Leap
Zvi · 2024-02-29T16:10:04.049Z · comments (0)

The Gemini Incident Continues
Zvi · 2024-02-27T16:00:05.648Z · comments (6)

On Tapping Out
Screwtape · 2023-11-17T03:23:55.880Z · comments (13)

A starting point for making sense of task structure (in machine learning)
Kaarel (kh) · 2024-02-24T01:51:49.227Z · comments (2)

[link] Towards Evaluating AI Systems for Moral Status Using Self-Reports
Ethan Perez (ethan-perez) · 2023-11-16T20:18:51.730Z · comments (3)

[link] NYT on the Manifest forecasting conference
Austin Chen (austin-chen) · 2023-10-09T21:40:16.732Z · comments (14)

Back to Basics: Truth is Unitary
lsusr · 2024-03-29T21:10:33.399Z · comments (13)

[link] Chinese scientists acknowledge xrisk & call for international regulatory body [Linkpost]
Akash (akash-wasil) · 2023-11-01T13:28:43.723Z · comments (4)

[link] Non-alignment project ideas for making transformative AI go well
Lukas Finnveden (Lanrian) · 2024-01-04T07:23:13.658Z · comments (1)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
Sonia Joseph (redhat) · 2024-03-13T17:09:17.027Z · comments (13)

On Trust
johnswentworth · 2023-12-06T19:19:07.680Z · comments (26)

D&D.Sci Long War: Defender of Data-mocracy Evaluation & Ruleset
aphyer · 2024-05-14T03:35:10.586Z · comments (3)

On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche
Zack_M_Davis · 2024-01-09T23:12:20.349Z · comments (31)

[link] Against Student Debt Cancellation From All Sides of the Political Compass
Maxwell Tabarrok (maxwell-tabarrok) · 2024-05-13T14:55:57.525Z · comments (16)

Commonsense Good, Creative Good
jefftk (jkaufman) · 2023-09-27T19:50:07.486Z · comments (11)

[link] EPUBs of MIRI Blog Archives and selected LW Sequences
mesaoptimizer · 2023-10-26T14:17:11.538Z · comments (6)

[link] Level up your spreadsheeting
angelinahli · 2024-05-25T14:57:19.730Z · comments (11)

Userscript to always show LW comments in context vs at the top
Vlad Sitalo (harcisis) · 2023-11-21T17:53:30.418Z · comments (8)

Truthseeking, EA, Simulacra levels, and other stuff
Elizabeth (pktechgirl) · 2023-10-27T23:56:49.198Z · comments (12)

[link] LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery (arjun-panickssery) · 2024-04-17T21:09:12.007Z · comments (1)

[link] Amazon to invest up to $4 billion in Anthropic
Davis_Kingsley · 2023-09-25T14:55:35.983Z · comments (8)

Higher-Order Forecasts
ozziegooen · 2024-05-22T21:49:42.802Z · comments (1)

New intro textbook on AIXI
Alex_Altair · 2024-05-11T18:18:50.945Z · comments (8)

[link] Fluent dreaming for language models (AI interpretability method)
tbenthompson (ben-thompson) · 2024-02-06T06:02:59.296Z · comments (4)

Apply to LASR Labs: a London-based technical AI safety research programme
Erin Robertson · 2024-04-09T17:34:06.847Z · comments (1)

Auditing failures vs concentrated failures
ryan_greenblatt · 2023-12-11T02:47:35.703Z · comments (0)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
Rubi J. Hudson (Rubi) · 2024-07-16T22:44:17.128Z · comments (27)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

AI #38: Let’s Make a Deal
Zvi · 2023-11-16T19:50:05.442Z · comments (2)

[link] Open Sourcing Metaculus
ChristianWilliams · 2024-07-02T22:30:01.339Z · comments (0)

What does davidad want from «boundaries»?
Chipmonk · 2024-02-06T17:45:42.348Z · comments (1)

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
Gytis Daujotas (gytis-daujotas) · 2024-08-01T21:08:38.800Z · comments (6)

The Third Fundamental Question
Screwtape · 2024-11-15T04:01:33.770Z · comments (2)

ProLU: A Nonlinearity for Sparse Autoencoders
Glen Taggart · 2024-04-23T14:09:21.592Z · comments (4)

Announcing Atlas Computing
miyazono · 2024-04-11T15:56:31.241Z · comments (4)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

matrice-jacobine on Proposing the Conditional AI Safety Treaty (linkpost TIME)

Fortunately, the existential risks posed by AI are recognized by many close to President-elect Donald Trump. His daughter Ivanka seems to see the urgency of the problem. Elon Musk, a critical Trump backer, has been outspoken about the civilizational risks for many years, and recently supported California’s legislative push to safety-test AI. Even the right-wing Tucker Carlson provided common-sense commentary when he said: “So I don’t know why we’re sitting back and allowing this to happen, if we really believe it will extinguish the human race or enslave the human race. Like, how can that be good?” For his part, Trump has expressed concern about the risks posed by AI, too.

This is a strange contrast from the rest of the article, considering both Donald and Ivanka Trump's positions are largely informed by the "situational awareness" position arguing that the US should develop AGI before China to ensure US victory over China – which is explicitly the position Tegmark [LW · GW] and Leahy argue against (and consider existentially harmful) when they call to stop work on AGI and work on international co-operation to restrict it and develop tool AI instead.

I still see this kind of confusion between the two positions a fair bit and it is extremely strange. It's like if back in the original Cold War people couldn't tell the difference between anti-communist hawks and the Bulletin of the Atomic Scientists (let alone anti-war hippies) because technically they both considered nuclear arms race to be very important for the future of humanity.

abramdemski on o1 is a bad idea

Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.

One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.

A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won't help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn't avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns.

Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can't be vulnerable to steering by adversarial inputs.

benito on Lao Mein's Shortform

In the email above, clearly stated, is a line of reasoning that has lead very competent people to work extremely hard to build potentially-omnicidal machines.

paulbecon on Beyond the Reach of God

in 2015 a Nature paper literally titled “Undecidability of the spectral gap” showed that an important physical quantity—the difference in energy between ground state and the initial excited state of a material—is formally undecidable. It is also triggered by recursion (basically, they encode questions about spectral gaps into spectral gaps). [Quoted from Erik Hoel's 2024 post on Consciousness as a Godel Sentence]

daniel-kokotajlo on 5 ways to improve CoT faithfulness

I encourage you to make this point into a blog post.

sharmake-farah on dirk's Shortform

Why do you think that LLMs will hit a wall in the future?

linch on The Median Researcher Problem

Again using the replication crisis as an example, you may have noticed the very wide (like, 1 sd or more) average IQ gap between students in most fields which turned out to have terrible replication rates and most fields which turned out to have fine replication rates.

This is rather weak evidence for your claim ("memeticity in a scientific field is mostly determined, not by the most competent researchers in the field, but instead by roughly-median researchers"), unless you additionally posit an additional mechanism like fields with terrible replication rates have a higher standard deviation than fields without them (why?).

abramdemski on o1 is a bad idea

I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI).

What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I've updated on the modern progress of AI in a different way than they have.

For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)

One thing I can say is that I myself often argued using "naive misinterpretation"-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme "the AI will understand what the humans mean, it just won't care". I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.

This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.

Getting more specific to what I wrote in the post:

My claim is that modern LLMs are "doing roughly what they seem like they are doing" and "internalize human intuitive concepts". This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they "roughly are").

The reason I don't think this contradicts with Eliezer's bolded claim ("Getting a shape into the AI's preferences is different from getting it into the AI's predictive model") is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive [LW · GW].

satron on Simple probes can catch sleeper agents

A recent paper has concluded that this solution (simple probes) doesn't generalize to unintended deception. Do you think this is concerning?

abandon on dirk's Shortform

I fully expect LLMs to hit a wall (if not now then in the future), but for any specific claims about timing, it's worth remembering that people frequently claim it's happening soon/has already happened, and will be wrong every time but one. Some past examples:
Facebook's then-head of AI, December 2019: https://www.wired.com/story/facebooks-ai-says-field-hit-wall/
Gary Marcus, March 2022: https://nautil.us/deep-learning-is-hitting-a-wall-238440/
Donald Hobson, August 2022: https://www.lesswrong.com/posts/gqqhYijxcKAtuAFjL/a-data-limited-future [LW · GW]
Epoch AI, November 2022 (estimating high-quality language data exhausted by 2024; in 2024 they updated their projection to 2028): https://epoch.ai/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset
Will Eden, February 2023 (thread): https://x.com/WilliamAEden/status/1630690003830599680
Sam Altman, April 2023: https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/
Justis Mills, June 2024: https://www.lesswrong.com/posts/axjb7tN9X2Mx4HzPz/the-data-wall-is-important [LW · GW] (he cites Leopold Aschenbrenner for more detail, but Aschenbrenner himself is optimistic so I didn't link directly)