LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-14T23:23:26.296Z · comments (1)

[link] Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
Daniel C (harper-owen) · 2024-09-07T10:04:47.840Z · comments (18)

OpenAI defected, but we can take honest actions
Remmelt (remmelt-ellen) · 2024-10-21T08:41:25.728Z · comments (15)

My career exploration: Tools for building confidence
lynettebye · 2024-09-13T11:37:55.843Z · comments (0)

[link] AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering
Corin Katzke (corin-katzke) · 2024-07-29T17:50:52.454Z · comments (1)

[link] Why Swiss watches and Taylor Swift are AGI-proof
Kevin Kohler (KevinKohler) · 2024-09-05T13:23:27.033Z · comments (11)

[question] Is this voting system strategy proof?
Donald Hobson (donald-hobson) · 2024-09-06T20:44:46.691Z · answers+comments (9)

[question] Self-censoring on AI x-risk discussions?
Decaeneus · 2024-07-01T18:24:15.759Z · answers+comments (2)

[link] Instruction Following without Instruction Tuning
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-24T13:49:09.078Z · comments (0)

Travel Buffer
jefftk (jkaufman) · 2024-07-06T02:20:02.723Z · comments (3)

Training a Sparse Autoencoder in < 30 minutes on 16GB of VRAM using an S3 cache
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-08-24T07:39:00.057Z · comments (0)

Invitation to lead a project at AI Safety Camp (Virtual Edition, 2025)
Linda Linsefors · 2024-08-23T14:18:24.327Z · comments (2)

[link] Minimalist And Maximalist Type Systems
adamShimi · 2024-07-05T16:25:59.448Z · comments (6)

[question] Is there any rigorous work on using anthropic uncertainty to prevent situational awareness / deception?
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-09-04T12:40:07.678Z · answers+comments (7)

[link] Will we ever run out of new jobs?
Kevin Kohler (KevinKohler) · 2024-08-19T15:04:03.849Z · comments (7)

OpenAI Boycott Revisit
Jake Dennie · 2024-07-22T01:44:55.094Z · comments (2)

Initial Experiments Using SAEs to Help Detect AI Generated Text
Aaron_Scher · 2024-07-22T05:16:20.516Z · comments (0)

[link] The Dumbification of our smart screens
Itay Dreyfus (itay-dreyfus) · 2024-07-04T06:32:36.672Z · comments (0)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (0)

Automating LLM Auditing with Developmental Interpretability
htlou · 2024-09-04T15:50:04.337Z · comments (0)

"Which Future Mind is Me?" Is a Question of Values
dadadarren · 2024-08-09T18:17:09.884Z · comments (12)

[link] Non-Transactional Compliments
Jonathan Moregård (JonathanMoregard) · 2024-08-09T13:42:16.471Z · comments (0)

[link] Announcing The Techno-Humanist Manifesto: A new philosophy of progress for the 21st century
jasoncrawford · 2024-07-08T16:33:02.194Z · comments (4)

All the Following are Distinct
Gianluca Calcagni (gianluca-calcagni) · 2024-08-02T16:35:51.815Z · comments (3)

The Residual Expansion: A Framework for thinking about Transformer Circuits
Daniel Tan (dtch1997) · 2024-08-02T11:04:56.347Z · comments (13)

An information-theoretic study of lying in LLMs
Annah (annah) · 2024-08-02T10:06:39.312Z · comments (0)

[link] Why good things often don’t lead to better outcomes
DMMF · 2024-09-19T16:37:07.778Z · comments (1)

Slave Morality: A place for every man and every man in his place
Martin Sustrik (sustrik) · 2024-09-19T04:20:04.491Z · comments (7)

Appealing to the Public
jefftk (jkaufman) · 2024-10-23T19:00:07.669Z · comments (0)

[question] Does the "ancient wisdom" argument have any validity? If a particular teaching or tradition is old, to what extent does this make it more trustworthy?
SpectrumDT · 2024-11-04T15:20:14.822Z · answers+comments (49)

[question] Is there a CFAR handbook audio option?
FinalFormal2 · 2024-10-26T17:08:36.480Z · answers+comments (0)

[link] My lukewarm take on GLP-1 agonists
George3d6 · 2024-08-26T12:34:27.929Z · comments (0)

Interview with Robert Kralisch on Simulators
WillPetillo · 2024-08-26T05:49:15.543Z · comments (0)

Hiring a writer to co-author with me (Spencer Greenberg for ClearerThinking.org)
spencerg · 2024-10-27T17:34:50.479Z · comments (0)

Physical Therapy Sucks (but have you tried hiding it in some peanut butter?)
Declan Molony (declan-molony) · 2024-09-10T05:54:47.000Z · comments (12)

Review: Dr Stone
ProgramCrafter (programcrafter) · 2024-09-29T10:35:53.175Z · comments (5)

[link] CultFrisbee
Gauraventh (aryangauravyadav) · 2024-08-11T21:36:36.550Z · comments (3)

My experience applying to MATS 6.0
mic (michael-chen) · 2024-07-18T19:02:21.849Z · comments (3)

Reducing global AI competition through the Commerce Control List and Immigration reform: a dual-pronged approach
Ben Smith (ben-smith) · 2024-09-03T05:28:24.549Z · comments (2)

[link] Announcing the AI Forecasting Benchmark Series | July 8, $120k in Prizes
ChristianWilliams · 2024-07-02T22:33:23.512Z · comments (0)

[link] Levers for Biological Progress - A Response to "Machines of Loving Grace"
Niko_McCarty (niko-2) · 2024-11-01T16:35:08.221Z · comments (0)

Activation Pattern SVD: A proposal for SAE Interpretability
Daniel Tan (dtch1997) · 2024-06-28T22:12:48.789Z · comments (2)

[link] Where is the Learn Everything System?
Shoshannah Tekofsky (DarkSym) · 2024-09-27T21:30:16.379Z · comments (8)

[link] Pronouns are Annoying
ymeskhout · 2024-09-18T13:30:04.620Z · comments (21)

2024 NYC Secular Solstice & Megameetup
Joe Rogero · 2024-11-12T17:46:18.674Z · comments (0)

Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI
Seth Herd · 2024-11-12T18:23:53.533Z · comments (2)

Join a LessWrong Team for the Unaging System Challenge
Crissman · 2024-10-23T06:01:08.018Z · comments (5)

Electric Grid Cyberattack: An AI-Informed Threat Model
moonlightmaze · 2024-11-11T21:34:17.190Z · comments (0)

Toward a taxonomy of cognitive benchmarks for agentic AGIs
Ben Smith (ben-smith) · 2024-06-27T23:50:11.714Z · comments (0)

Announcing the Ultimate Jailbreaking Championship
InnerHufflepuff (grayswan) · 2024-09-04T00:35:31.234Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

abandon on dirk's Shortform

I more-or-less endorse the model described in larger language models may disappoint you [or, an eternally unfinished draft] [LW · GW], and moreover I think language is an inherently lossy instrument such that the minimally-lossy model won't have perfectly learned the causal processes or whatever behind its production.

akash-wasil on Making a conservative case for alignment

I agree with many points here and have been excited about AE Studio's outreach. Quick thoughts on China/international AI governance:

I think some international AI governance proposals have some sort of "kum ba yah, we'll all just get along" flavor/tone to them, or some sort of "we should do this because it's best for the world as a whole" vibe. This isn't even Dem-coded so much as it is naive-coded, especially in DC circles.
US foreign policy is dominated primarily by concerns about US interests. Other considerations can matter, but they are not the dominant driving force. My impression is that this is true within both parties (with a few exceptions).
I think folks interested in international AI governance should study international security agreements and try to get a better understanding of relevant historical case studies. Lots of stuff to absorb from the Cold War, the Iran Nuclear Deal, US-China relations over the last several decades, etc. (I've been doing this & have found it quite helpful.)
Strong Republican leaders can still engage in bilateral/multilateral agreements that serve US interests. Recall that Reagan negotiated arms control agreements with the Soviet Union, and the (first) Trump Administration facilitated the Abraham Accords. Being "tough on China" doesn't mean "there are literally no circumstances in which I would be willing to sign a deal with China." (But there likely does have to be a clear case that the deal serves US interests, has appropriate verification methods, etc.)

matrice-jacobine on Proposing the Conditional AI Safety Treaty (linkpost TIME)

Fortunately, the existential risks posed by AI are recognized by many close to President-elect Donald Trump. His daughter Ivanka seems to see the urgency of the problem. Elon Musk, a critical Trump backer, has been outspoken about the civilizational risks for many years, and recently supported California’s legislative push to safety-test AI. Even the right-wing Tucker Carlson provided common-sense commentary when he said: “So I don’t know why we’re sitting back and allowing this to happen, if we really believe it will extinguish the human race or enslave the human race. Like, how can that be good?” For his part, Trump has expressed concern about the risks posed by AI, too.

This is a strange contrast from the rest of the article, considering both Donald and Ivanka Trump's positions are largely informed by the "situational awareness" position arguing that the US should develop AGI before China to ensure US victory over China – which is explicitly the position Tegmark [LW · GW] and Leahy argue against (and consider existentially harmful) when they call to stop work on AGI and work on international co-operation to restrict it and develop tool AI instead.

I still see this kind of confusion between the two positions a fair bit and it is extremely strange. It's like if back in the original Cold War people couldn't tell the difference between anti-communist hawks and the Bulletin of the Atomic Scientists (let alone anti-war hippies) because technically they both considered nuclear arms race to be very important for the future of humanity.

abramdemski on o1 is a bad idea

Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.

One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.

A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won't help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn't avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns.

Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can't be vulnerable to steering by adversarial inputs.

benito on Lao Mein's Shortform

In the email above, clearly stated, is a line of reasoning that has lead very competent people to work extremely hard to build potentially-omnicidal machines.

paulbecon on Beyond the Reach of God

in 2015 a Nature paper literally titled “Undecidability of the spectral gap” showed that an important physical quantity—the difference in energy between ground state and the initial excited state of a material—is formally undecidable. It is also triggered by recursion (basically, they encode questions about spectral gaps into spectral gaps). [Quoted from Erik Hoel's 2024 post on Consciousness as a Godel Sentence]

daniel-kokotajlo on 5 ways to improve CoT faithfulness

I encourage you to make this point into a blog post.

sharmake-farah on dirk's Shortform

Why do you think that LLMs will hit a wall in the future?

linch on The Median Researcher Problem

Again using the replication crisis as an example, you may have noticed the very wide (like, 1 sd or more) average IQ gap between students in most fields which turned out to have terrible replication rates and most fields which turned out to have fine replication rates.

This is rather weak evidence for your claim ("memeticity in a scientific field is mostly determined, not by the most competent researchers in the field, but instead by roughly-median researchers"), unless you additionally posit an additional mechanism like fields with terrible replication rates have a higher standard deviation than fields without them (why?).

abramdemski on o1 is a bad idea

I more-or-less agree with Eliezer's comment (to the extent that I have the data necessary to evaluate his words, which is greater than most, but still, I didn't know him in 1996). I have a small beef with his bolded "MIRI is always in every instance" claim, because a universal like that is quite a strong claim, and I would be very unsurprised to find a single counterexample somewhere (particularly if we include every MIRI employee and everything they've ever said while employed at MIRI).

What I am trying to say is something looser and more gestalt. I do think what I am saying contains some disagreement with some spirit-of-MIRI, and possibly some specific others at MIRI, such that I could say I've updated on the modern progress of AI in a different way than they have.

For example, in my update, the modern progress of LLMs points towards the Paul side of some Eliezer-Paul debates. (I would have to think harder about how to spell out exactly which Eliezer-Paul debates.)

One thing I can say is that I myself often argued using "naive misinterpretation"-like cases such as the paperclip example. However, I was also very aware of the Eliezer-meme "the AI will understand what the humans mean, it just won't care". I would have predicted difficulty in building a system which correctly interprets and correctly cares about human requests to the extent that GPT4 does.

This does not mean that AI safety is easy, or that it is solved; only that it is easier than I anticipated at this particular level of capability.

Getting more specific to what I wrote in the post:

My claim is that modern LLMs are "doing roughly what they seem like they are doing" and "internalize human intuitive concepts". This does include some kind of claim that these systems are more-or-less ethical (they appear to be trying to be helpful and friendly, therefore they "roughly are").

The reason I don't think this contradicts with Eliezer's bolded claim ("Getting a shape into the AI's preferences is different from getting it into the AI's predictive model") is that I read Eliezer as talking about strongly superhuman AI with this claim. It is not too difficult to get something into the values of some basic reinforcement learning agent, to the extent that something like that has values worth speaking of. It gets increasingly difficult as the agent gets cleverer. At the level of intelligence of, say, GPT4, there is not a clear difference between getting the LLM to really care about something vs merely getting those values into its predictive model. It may be deceptive or honest; or, it could even be meaningless to classify it as deceptive or honest. This is less true of o1, since we can see it actively scheming to deceive [LW · GW].