LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Do not delete your misaligned AGI.
mako yass (MakoYass) · 2024-03-24T21:37:07.724Z · comments (13)

[link] on bacteria, on teeth
bhauth · 2024-09-30T15:56:56.830Z · comments (9)

Don't sleep on Coordination Takeoffs
trevor (TrevorWiesinger) · 2024-01-27T19:55:26.831Z · comments (24)

[Intuitive self-models] 6. Awakening / Enlightenment / PNSE
Steven Byrnes (steve2152) · 2024-10-22T13:23:08.836Z · comments (8)

[link] Twitter thread on AI safety evals
Richard_Ngo (ricraz) · 2024-07-31T00:18:14.076Z · comments (3)

A framework for thinking about AI power-seeking
Joe Carlsmith (joekc) · 2024-07-24T22:41:01.685Z · comments (15)

[link] Slightly More Than You Wanted To Know: Pregnancy Length Effects
JustisMills · 2024-10-21T01:26:02.030Z · comments (4)

Catastrophic Goodhart in RL with KL penalty
Thomas Kwa (thomas-kwa) · 2024-05-15T00:58:20.763Z · comments (10)

Why our politicians aren't Median
Yair Halberstadt (yair-halberstadt) · 2024-11-03T14:03:33.779Z · comments (15)

[link] Dario Amodei — Machines of Loving Grace
Matrice Jacobine · 2024-10-11T21:43:31.448Z · comments (26)

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To
robertzk (Technoguyrob) · 2024-03-06T05:03:09.639Z · comments (0)

On coincidences and Bayesian reasoning, as applied to the origins of COVID-19
viking_math · 2024-02-19T01:14:06.772Z · comments (28)

[link] "We know how to build AGI" - Sam Altman
Nikola Jurkovic (nikolaisalreadytaken) · 2025-01-06T02:05:05.134Z · comments (5)

AI #55: Keep Clauding Along
Zvi · 2024-03-14T15:40:09.335Z · comments (16)

RTFB: California’s AB 3211
Zvi · 2024-07-30T13:10:03.853Z · comments (2)

[link] Electrostatic Airships?
DaemonicSigil · 2024-10-27T04:32:34.852Z · comments (13)

[link] Zen and The Art of Semiconductor Manufacturing
Recurrented (rachel-farley) · 2024-12-09T17:19:35.236Z · comments (2)

Managing catastrophic misuse without robust AIs
ryan_greenblatt · 2024-01-16T17:27:31.112Z · comments (17)

Inspired by: Failures in Kindness
X4vier · 2024-07-27T01:21:42.848Z · comments (2)

[link] Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
Robert_AIZI · 2024-03-05T13:55:33.483Z · comments (24)

MATS Alumni Impact Analysis
utilistrutil · 2024-09-30T02:35:57.273Z · comments (7)

Managing risks while trying to do good
Wei Dai (Wei_Dai) · 2024-02-01T18:08:46.506Z · comments (26)

Cognitive Work and AI Safety: A Thermodynamic Perspective
Daniel Murfet (dmurfet) · 2024-12-08T21:42:17.023Z · comments (9)

ReSolsticed vol I: "We're Not Going Quietly"
Raemon · 2024-12-26T17:52:33.727Z · comments (4)

Natural Latents Are Not Robust To Tiny Mixtures
johnswentworth · 2024-06-07T18:53:36.643Z · comments (8)

Why imperfect adversarial robustness doesn't doom AI control
Buck · 2024-11-18T16:05:06.763Z · comments (25)

Training AI agents to solve hard problems could lead to Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-11-19T00:10:55.522Z · comments (12)

A civilization ran by amateurs
Olli Järviniemi (jarviniemi) · 2024-05-30T17:57:32.601Z · comments (7)

Showing SAE Latents Are Not Atomic Using Meta-SAEs
Bart Bussmann (Stuckwork) · 2024-08-24T00:56:46.048Z · comments (9)

[link] DeepMind: Evaluating Frontier Models for Dangerous Capabilities
Zach Stein-Perlman · 2024-03-21T03:00:31.599Z · comments (8)

AI #78: Some Welcome Calm
Zvi · 2024-08-22T14:20:10.812Z · comments (15)

[link] electric turbofans
bhauth · 2024-11-02T22:50:59.807Z · comments (2)

Balancing Games
jefftk (jkaufman) · 2024-02-24T14:40:04.237Z · comments (18)

Book Review: On the Edge: The Future
Zvi · 2024-09-27T14:00:05.279Z · comments (1)

Checking in on Scott's composition image bet with imagen 3
Dave Orr (dave-orr) · 2024-12-22T19:04:17.495Z · comments (0)

What is SB 1047 *for*?
Raemon · 2024-09-05T17:39:39.871Z · comments (8)

[question] We might be dropping the ball on Autonomous Replication and Adaptation.
Charbel-Raphaël (charbel-raphael-segerie) · 2024-05-31T13:49:11.327Z · answers+comments (30)

Consider the humble rock (or: why the dumb thing kills you)
pleiotroth · 2024-07-04T13:54:15.593Z · comments (11)

Offering AI safety support calls for ML professionals
Vael Gates · 2024-02-15T23:48:12.797Z · comments (1)

There Should Be More Alignment-Driven Startups
Vaniver · 2024-05-31T02:05:06.799Z · comments (14)

A case for donating to AI risk reduction (including if you work in AI)
tlevin (trevor) · 2024-12-02T19:05:06.658Z · comments (2)

0th Person and 1st Person Logic
Adele Lopez (adele-lopez-1) · 2024-03-10T00:56:14.446Z · comments (28)

Pollsters Should Publish Question Translations
jefftk (jkaufman) · 2024-09-08T22:10:04.932Z · comments (3)

Against empathy-by-default
Steven Byrnes (steve2152) · 2024-10-16T16:38:49.926Z · comments (24)

[link] In Defense of Epistemic Empathy
Kevin Dorst · 2023-12-27T16:27:06.320Z · comments (19)

[link] Is Claude a mystic?
jessicata (jessica.liu.taylor) · 2024-06-07T04:27:09.118Z · comments (23)

Base LLMs refuse too
Connor Kissane (ckkissane) · 2024-09-29T16:04:21.343Z · comments (20)

An Actually Intuitive Explanation of the Oberth Effect
Isaac King (KingSupernova) · 2024-01-10T20:23:17.216Z · comments (33)

Approaching Human-Level Forecasting with Language Models
Fred Zhang (fred-zhang) · 2024-02-29T22:36:34.012Z · comments (6)

[link] How do open AI models affect incentive to race?
jessicata (jessica.liu.taylor) · 2024-05-07T00:33:20.658Z · comments (13)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sheikh-abdur-raheem-ali on Sheikh Abdur Raheem Ali's Shortform

I've tried speaking with a few teams doing AI safety work, including:
• assistant professor leading an alignment research group at a top university who is starting a new AI safety org
• anthropic independent contractor who has coauthored papers with the alignment science team
• senior manager at nvidia working on LLM safety (NeMo-Aligner/NeMo-Guardrails)
• leader of a lab doing interoperability between EU/Canada AI standards
• ai policy fellow at US Senate working on biotech strategies
• executive director of an ai safety coworking space who has been running weekly meetups for ~2.5 years
• startup founder in stealth who asked not to share details with anyone outside CAISI
• chemistry olympiad gold medalist working on a dangerous capabilities evals project for o3
• mats alumni working on jailbreak mitigation at an ai safety & security org
• ai safety research lead running a mechinterp reading group and interning at EleuthrAI

Some random brief thoughts:
• CAISI's focus seems to be on stuff other than x-risks (i.e, misinformation, healthcare, privacy).
• I'm afraid of being too unfiltered and causing offence.
• Some of the statements made in the interviews are bizarrely devoid of content, such as:

"AI safety work is not only a necessity to protect our social advances, but also essential for AI itself to remain a meaningful technology."

• Others seem to be false as stated, such as:

"our research on privacy-preserving AI led us to research machine unlearning — how to remove data from AI systems — which is now an essential consideration for deploying large-scale AI systems like chatbots."

• (I think a lot of unlearning research is bullshit, but besides that, is anyone deploying large models doing unlearning?)
• The UK AISI research agendas seemed a lot more coherent with better developed proposals and theories of impact.
• They're only recruiting for 3 positions for a research council that meets once a month?
• $27M of CAISI's initial funding is ~15% of the UK AISI's GBP 100m initial funding
• Another source says $50m CAD, but that's distributed over 5 years compared to a $2.4b budget for AI in general, so about 2% of the AI budget goes to safety?
• I was looking for scientific advancements which would be relevant at the national scale. I read through every page of anthropic/redwood's alignment faking paper, which is considered the best empirical alignment research paper of 2024, but it was a firehose of info and I don't have clear recommendations that can be put into a slide deck.
• Instead of learning more about what other people were doing on a shallow level it might've been more beneficial to focus on my own research questions or practice training project relevant skills.

ryan_greenblatt on What Indicators Should We Watch to Disambiguate AGI Timelines?

Yes. Though notably, if your employees were 10x faster you might want to adjust your workflows to have them spend less time being bottlenecked on compute if that is possible. (And this sort of adaption is included in what I mean.)

nate-showell on Rebuttals for ~all criticisms of AIXI

The uncomputability of AIXI is a bigger problem than this post makes it out to be. This uncomputability inserts a contradiction into any proof that relies on AIXI -- the same contradiction as in Goedel's Theorem. You can get around this contradiction instead by using approximations of AIXI, but the resulting proofs will be specific to those approximations, and you would need to prove additional theorems to transfer results between the approximations.

faul_sname on What Indicators Should We Watch to Disambiguate AGI Timelines?

I think I misunderstood what you were saying there - I interpreted it as something like

Currently, ML-capable software developers are quite expensive relative to the cost of compute. Additionally, many small experiments provide more novel and useful insights than a few large experiments. The top practically-useful LLM costs about 1% as much per hour to run as a ML-capable software developer, and that 100x decrease in cost and the corresponding switch to many small-scale experiments would likely result in at least a 10x increase in the speed at which novel, useful insights were generated.

But on closer reading I see you said (emphasis mine)

I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you'd get as if the human employees operated 10x faster. (In other words, 10x'ing this labor input.)

So if the employees spend 50% of their time waiting on training runs which are bottlenecked on company-wide availability of compute resources, and 50% of their time writing code, 10xing their labor input (i.e. the speed at which they write code) would result in about an 80% increase in their labor output. Which, to your point, does seem plausible.

quetzal_rainbow on How can humanity survive a multipolar AGI scenario?

I think a lot of thinking around multipolar scenarios suffers from heuristic "solution in the shape of the problem", i.e. "multipolar scenario is when we have kinda aligned AI, but still die due to coordination failures, therefore, solution for multipolar scenarios should be about coordination".

I think the correct solution is to leverage available superintelligence in nice unilateral way:

D/acc - use superintelligence to put as much defence as you can, starting from formal software verification and ending in spreading biodefence nanotech;
Running away - if you set up Moon/Mars/Jovian colony of nanotech-upgraded humans/uploads and pour available resources into defence, even if Earth explodes, humanity as a species survives.

hzn on Drake Thomas's Shortform

Do you have any thoughts on mechanism & whether prevention is actually worse independent of inconvenience?

zac-hatfield-dodds on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

And I've received an email from Mieux Donner confirming Lucie's leg has been executed for 1,000€. Thanks to everyone involved!

If if anyone else is interested in a similar donation swap, from either side, I'd be excited to introduce people or maybe even do this trick again :D

startattheend on The average rationalist IQ is about 122

I have more reasons for believing that Mensa members are below 130, but also for believing that they're above.

Below: Most online IQ tests are similar enough to the Mensa IQ test that the practice effect applies. And most people who obsess about their IQ scores probably take a lot of online IQ tests, memorizing most patterns (there's a limit to the practice effect, but it can still give you at least 10 points)

Above: Mensa tests for pattern recognition abilities, which in my experience correlates worse with academic performance than verbal abilities. Pattern recognition abilities also select for people with autism (they tend to score about 20 points higher on RPM-like pattern recognition tests (matrices) than on other subtests). These people will be smarter than they sound, because their low verbal abilities makes them appear stupid, even though their pattern recognition might be 2 standard deviations higher. So you get intelligent people with poor social skills, who sound much dumber than they are, and who tend to have more diagnoses than just autism. It's no wonder that these people go to forums like Mensa, or that they're less successful in life than their IQ would suggest. These people are also incredibly easy targets by the kind of people who go to r/iamverysmart so it's easy to build the public consensus that they're actually stupid, even when it isn't true.

However, in order for high intelligence to shine (and have worthy insights) even without formal education, IQs above 150 are likely needed. For in order to generate your own ideas and still be able to compete with the consensus (which is largely based off the theories of genuises like Tesla, Einstein, Neumann, Turing, Pavlov, etc.) you need to discover similar things yourself independently.

I think many rationalists are above 130. I don't like rationalist mentalities very much though. They seem to think that everything needs to have a source or a proof (a projected lack of confidence in their own discernment). They also tend to overestimate the value of knowledge (even sometimes using it as a synonym of intelligence). If somebodies IQ is, say, 110, I don't think they will ever have any great takes (even with years of studies) which a 140 IQ person couldn't run circles around given a week or two of thoughts. Ever seen somebody invest their whole life into something that you could dismantle or do better in 5 minutes? You could look at this and go "Rapid feedback is better because you approximate reality and update your beliefs faster, makes sense, but why overcompl- right, it's to make mone- to legitimize the only position in which they are thought to have value - because agile coaches are selling ideas/theory and rely on the illusion of substance of course"

thane-ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines?

Minor would count.

mishka on Rebuttals for ~all criticisms of AIXI

However, I don't view safe tiling as the primary obstacle to alignment. Constructing even a modestly superhuman agent which is aligned to human values would put us in a drastically stronger position and currently seems out of reach. If necessary, we might like that agent to recursively self-improve safely, but that is an additional and distinct obstacle. It is not clear that we need to deal with recursive self-improvement below human level.

I am not sure that treating recursive self-improvement via tiling frameworks is necessarily a good idea, but setting this aspect aside, one obvious weakness with this argument is that it mentions a superhuman case and a below human level case, but it does not mention the approximately human level case.

And it is precisely the approximately human level case where we have a lot to say about recursive self-improvement, and where it feels that avoiding this set of considerations would be rather difficult.

Humans often try to self-improve, and human-level software will have advantage over humans at that.

Humans are self-improving in the cognitive sense by shaping their learning experiences, and also by controlling their nutrition and various psychoactive factors modulating cognition. The desire to become smarter and to improve various thinking skills is very common.

Human-level software would have great advantage over humans at this, because it can hack at its own internals with great precision at the finest resolution and because it can do so in a reversible fashion (on a copy, or after making a backup), and so can do it in a relatively safe manner (whereas a human has difficulty hacking their own internals with required precision and is also taking huge personal risks if hacking is sufficiently radical).

Collective/multi-agent aspects are likely to be very important.

People are already talking about possibilities of "hiring human-level artificial software engineers" (and, by extension, human-level artificial AI researchers). The wisdom of having an agent form-factor here is highly questionable, but setting this aspect aside and focusing only on technical feasibility, we see the following.

One can hire multiple artificial software engineers with long-term persistence (of features, memory, state, and focus) into an existing team of human engineers. Some of those teams will work on making next generations of better artificial software engineers (and artificial AI researchers). So now we are talking about mixed teams with human and artificial members.

By definition, we can say that those artificial software engineers and artificial AI researchers have reached human level, if a team of those entities would be able to fruitfully work on the next generation of artificial software engineers and artificial AI researchers even in the absence of any human team members.

This multi-agent setup is even more important than individual self-improvement, because this is what the mainstream trend might actually be leaning towards, judging by some recent discussions. Here we are talking about a multi-agent setup, and about recursive self-improvement of the community of agents, rather than focusing on self-improvement of individual agents.

Current self-improvement attempts.

We actually do see a lot of experiments with various forms of recursive self-improvements even at the current below human level. We are just lucky that all those attempts have been saturating at the reasonable levels so far.

We currently don't have good enough understanding to predict when they stop saturating, and what would the dynamics be when they stop saturating. But self-improvement by a community of approximately human-level artificial software engineers and artificial AI researchers competitive with top human software engineers and top human AI researcher seems unlikely to saturate (or, at least, we should seriously consider the possibility that it won't saturate).

At the same time, the key difficulties of AI existential safety are tightly linked to recursive self-modifications.

The most intractable aspect of the whole thing is how to preserve any properties indefinitely through radical self-modifications. I think this is the central difficulty of AI existential safety. Things will change unpredictably. How can one shape this unpredictable evolution so that some desirable invariants do hold?

These invariants would be invariant properties of the whole ecosystem, not of individual agents; they would be the properties of a rapidly changing world, not of a particular single system (unless one is talking about a singleton which is very much in control of everything). This seems to be quite central to our overall difficulty with AI existential safety.