LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien (alexandre-variengien) · 2023-12-19T11:52:27.354Z · comments (3)

MATS Winter 2023-24 Retrospective
utilistrutil · 2024-05-11T00:09:17.059Z · comments (28)

[link] [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij (teun-van-der-weij) · 2024-06-13T10:04:49.556Z · comments (10)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (10)

Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane (ckkissane) · 2024-01-16T00:26:14.767Z · comments (9)

AI #51: Altman’s Ambition
Zvi · 2024-02-20T19:50:07.439Z · comments (5)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (57)

Retirement Accounts and Short Timelines
jefftk (jkaufman) · 2024-02-19T18:50:05.231Z · comments (35)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (14)

OpenAI o1, Llama 4, and AlphaZero of LLMs
Vladimir_Nesov · 2024-09-14T21:27:41.241Z · comments (24)

Self-prediction acts as an emergent regularizer
Cameron Berg (cameron-berg) · 2024-10-23T22:27:03.664Z · comments (4)

Actually, Power Plants May Be an AI Training Bottleneck.
Lao Mein (derpherpize) · 2024-06-20T04:41:33.567Z · comments (13)

[link] What Depression Is Like
Sable · 2024-08-27T17:43:22.549Z · comments (23)

Some Vacation Photos
johnswentworth · 2024-01-04T17:15:01.187Z · comments (0)

AI #83: The Mask Comes Off
Zvi · 2024-09-26T12:00:08.689Z · comments (19)

Agent Boundaries Aren't Markov Blankets. [Unless they're non-causal; see comments.]
abramdemski · 2023-11-20T18:23:40.443Z · comments (11)

Coup probes: Catching catastrophes with probes trained off-policy
Fabien Roger (Fabien) · 2023-11-17T17:58:28.687Z · comments (7)

[link] Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes
owencb · 2024-04-16T10:10:13.338Z · comments (12)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (21)

Refusal mechanisms: initial experiments with Llama-2-7b-chat
Andy Arditi (andy-arditi) · 2023-12-08T17:08:01.250Z · comments (7)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
keith_wynroe · 2024-07-02T13:17:16.352Z · comments (7)

Bostrom Goes Unheard
Zvi · 2023-11-13T14:11:07.586Z · comments (9)

My Criticism of Singular Learning Theory
Joar Skalse (Logical_Lunatic) · 2023-11-19T15:19:16.874Z · comments (56)

AISafety.com – Resources for AI Safety
Søren Elverlin (soren-elverlin-1) · 2024-05-17T15:57:11.712Z · comments (3)

An Introduction To The Mandelbrot Set That Doesn't Mention Complex Numbers
Yitz (yitz) · 2024-01-17T09:48:07.930Z · comments (11)

[link] New voluntary commitments (AI Seoul Summit)
Zach Stein-Perlman · 2024-05-21T11:00:41.794Z · comments (17)

[link] Palworld development blog post
bhauth · 2024-01-28T05:56:19.984Z · comments (12)

Constructability: Plainly-coded AGIs may be feasible in the near future
Épiphanie Gédéon (joy_void_joy) · 2024-04-27T16:04:45.894Z · comments (13)

Values Are Real Like Harry Potter
johnswentworth · 2024-10-09T23:42:24.724Z · comments (17)

Survey of 2,778 AI authors: six parts in pictures
KatjaGrace · 2024-01-06T04:43:34.590Z · comments (1)

Studying The Alien Mind
Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-12-05T17:27:28.049Z · comments (10)

3C's: A Recipe For Mathing Concepts
johnswentworth · 2024-07-03T01:06:11.944Z · comments (5)

The Gemini Incident
Zvi · 2024-02-22T21:00:04.594Z · comments (19)

Self-Referential Probabilistic Logic Admits the Payor's Lemma
Yudhister Kumar (randomwalks) · 2023-11-28T10:27:29.029Z · comments (14)

[link] Not every accommodation is a Curb Cut Effect: The Handicapped Parking Effect, the Clapper Effect, and more
Michael Cohn (michael-cohn) · 2024-09-15T05:27:36.691Z · comments (39)

Announcing Athena - Women in AI Alignment Research
Claire Short (claire-short) · 2023-11-07T21:46:41.741Z · comments (2)

Thomas Kwa's research journal
Thomas Kwa (thomas-kwa) · 2023-11-23T05:11:08.907Z · comments (1)

How to prevent collusion when using untrusted models to monitor each other
Buck · 2024-09-25T18:58:20.693Z · comments (5)

[link] MIRI's May 2024 Newsletter
Harlan · 2024-05-15T00:13:30.153Z · comments (1)

The case for a negative alignment tax
Cameron Berg (cameron-berg) · 2024-09-18T18:33:18.491Z · comments (20)

Quick look: applications of chaos theory
Elizabeth (pktechgirl) · 2024-08-18T15:00:07.853Z · comments (45)

[Intuitive self-models] 2. Conscious Awareness
Steven Byrnes (steve2152) · 2024-09-25T13:29:02.820Z · comments (48)

[link] My thesis (Algorithmic Bayesian Epistemology) explained in more depth
Eric Neyman (UnexpectedValues) · 2024-05-09T19:43:16.543Z · comments (4)

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Joe Carlsmith (joekc) · 2023-11-15T17:16:42.088Z · comments (26)

LessWrong Community Weekend 2024, open for applications
UnplannedCauliflower · 2024-05-01T10:18:21.992Z · comments (2)

[link] Is "superhuman" AI forecasting BS? Some experiments on the "539" bot from the Centre for AI Safety
titotal (lombertini) · 2024-09-18T13:07:40.754Z · comments (3)

EU policymakers reach an agreement on the AI Act
tlevin (trevor) · 2023-12-15T06:02:44.668Z · comments (7)

Secular interpretations of core perennialist claims
zhukeepa · 2024-08-25T23:41:02.683Z · comments (32)

Spaciousness In Partner Dance: A Naturalism Demo
LoganStrohl (BrienneYudkowsky) · 2023-11-19T07:00:19.555Z · comments (5)

[link] The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review
jessicata (jessica.liu.taylor) · 2024-03-27T19:59:27.893Z · comments (36)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

florian-habermacher on Review: “The Case Against Reality”

Great you bring up Hoffman; I think he deserves serious pushback.

He proofs exactly two things:

Reality often is indeed not how it seems to us - as by much too many, his nonsense is taken at face value. I would normally not use such words but there are reasons in his case.
In as far as he has come to truly believe all he claims (not convinced!), he'd be a perfect example of self-serving beliefs: how his overblown claims manage to take over his brain, just as it has realized he can sell it with total success to the world, despite absurdity.

Before I explain this harsh judgement, caveat: I mean not to defend what we perceive. Let's be open to a world very different to how it seems. Also, maybe Hoffman has many interesting points. But this doesn't mean, his claims are not completely overblown - which I'm convinced they, are after having listened to a range of interviews with him and having gone to lengths for reading his original papers.

Here three simple points I find compelling to put his stuff into perspective:

You find a moaning gap between his claims and what he has really 'proven' in his papers. Speech: "We have mathematically proven there's absolutely zero chance blabla". Reality: Used a trivial evolutionary toy model and found a reduced form representation of a very specific process may be more economical/probable than a more complex representation of the 'real' process. It nicely underlines that evolution may take shortcuts. Yes, we're crazy about sex instead of about "creating children", or we want to eat sugary stuff as an ancient proxy for actually healthy diet which in our today's world doesn't function anymore, and many more things where we've not evolved to perceive/account for all the complexity. Problem? Is of course nothing new, and, more importantly, it doesn't proof anything more than that.
I like the following analogies:
1. Room-Mapping Robots vs. Non-Mapping Robot cleaners (Roomba stuff). A not too far-fetched interpretation to Hoffman would be: A (efficient) vacuum robot cannot map the room, it's always more efficient to simply have reduced-form rules/heuristics for where to move next. Well, it's nice to see how the market has evolved: Semi-random moving robots made the start, but it turns out if you want to have robots efficient, you make them actually map the territory, hence today LiDAR/SLAM become more dominating.
2. Being exposed to a cat, I realize she seems much more Hoffmanesque than us. When she pees on the ground, or smells another weird thing, she does her 'heap earth/sand over it' leg moves, not realizing there's just parquet so her move doesn't actually cover the stink. It's a bit funny then, with Hoffman the species that has overcome reliance on un-understanding instincts in so many (not all) domains, is the one that ends up claiming it would not ever be possible to overcome mere reduced-form instincts in whatsoever domain.
Trivial disproof by contradiction of Hofmann's claim of having absolutely proven the world could not be the way we think: Assume the world WAS just how it looks to us. Imagine there WAS then the billion-year evolutionary process that we THINK has happened. Is there anything in Hoffman's proofs showing that, then there could be only dumans, like humans but perceiving in 2d instead of 3d, or in some other wrong-way-with-no-resemblance-to-reality? Nope, absolutely not. His claims just obviously don't hold up.

Broader discussions highlighting I think in part some fraudulent aspect of Hoffman: The Case For Reality, or also Quora Is Donald Hoffman’s interface theory of perception true?

In sum: His popularity proves an evolutionary theory for information where what floats around is not what is shown to be correct, but what is appealing; distracting voices debunking it being entirely ignored. I imagine him laughing about this fact when thinking about his own success: "After all, my claim seems to not be that wrong, they do not perceive reality, mahahaaa". According to google there are not merely a million people reading him, but literally millions of webpages featuring him.

Happy to be debunked in my negative reading of him :)

thomas-kwa on An alternative approach to superbabies

A third approach to superbabies: physically stick >10 infant human brains together while they are developing so they form a single individual with >10x the neocortex neurons as the average humans. Forget +7sd, extrapolation would suggest they are >100sd intelligence.

Even better, we could find some way of networking brains together into supercomputers using configurable software. This would reduce potential health problems and also allow us to harvest their waste energy. Though we would have to craft a simulated reality to distract the non-useful conscious parts of the computational substrate, perhaps modeled on the year 1999...

joshc on Anthropic: Three Sketches of ASL-4 Safety Case Components

Some comments I left as a reply to Roger Grosse on Twitter:

I'm glad you put this out. These examples align with my guess as to roughly what arguments will end up being most load bearing for mitigating scheming (white-box monitoring, black-box monitoring, and evidence against scheming being rewarded).
Here are some comments, since you mentioned comments would be helpful. I've bolded the comments that I think would be most useful for you to read.
Safety case sketch #1:
- There are some additional claims that this safety case relies on that could be good to mention explicitly. e.g. "egregious behavior would be reliably identified when developers analyze flagged sessions" and "developers will become aware of and stop unacceptable behavior *before* it leads to harmful outcomes."
- "Given our mitigations, the model cannot strategically condition its apparent capabilities, motivations, or behavioral proclivities on whether it’s in testing or deployment without being caught" <- I'm confused about why this claim is important given that you are monitoring at deployment-time. Even if the model is conditioning its behavior, my understanding is that you are hoping to catch it when it defects.
- Much of the evidence for the effectiveness of your white-box monitoring comes from your model organisms, of which you only have three. Three examples doesn't give you a lot of statistical power. Naively, I'd think there's still roughly a one third probability that your white-box monitoring completely fails in the actual deployment context even if it succeeds on three models.
Safety case sketch #2:
- You mention external deployments, but I didn't see discussion of jailbreaking.
- "[The safety case sketch] may require expensive pre-deployment testing if one needs to generate a very large number of simulated deployment situations in order to achieve a sufficiently good bound on the probability of defection." This sounds impractical. My guess is that you'll need to continuously sample real deployment trajectories to generate your honeypots (where a honeypot is a inserted dangerous action for your monitoring to flag). Otherwise, it seems unlikely to me that you can verify indistinguishability even with current models.
Safety case sketch #3:
- It's possible that all of the models that you experiment with are opaquely scheming. i.e. instead of comparing reward attained by models that engage in "deceptive reasoning" vs "no deceptive reasoning," you might be comparing models that engage in "CoT deceptive reasoning" vs "opaque deceptive reasoning." This experiment still sheds light on how helpful deceptive reasoning is for obtaining reward, but the argument in this case is different from the way you framed it.
- A lot hinges on the claim "if an environment rewards opaque deceptive reasoning, then it rewards scratchpad-based deceptive reasoning." Here's an idea for validating this. First, construct a distribution of tasks where deceptive reasoning is likely rewarded. For example, select tasks that you *know* can be gamed by explicitly reasoning about the training process. Then, train models to perform well on these tasks with and without additionally training them to reason deceptively in their chain of thought. If models that reason deceptively in CoT do better on average, this provides evidence for your claim that we should see deceptive reasoning in CoT if opaque deceptive reasoning is incentivized. It's possible some of these comments were addressed in the post and I missed the part where they were addressed. I've errored on the side of inclusion here.

deepthoughtlife on Scissors Statements for President?

While there are legitimate differences that matter quite a bit between the sides, I believe a lot of the reason why candidates are like 'scissors statements' is because the median voter theorem actually kind of works, and the parties see the need to move their candidates pretty far toward the current center, but they also know they will lose the extremists to not voting or voting third party if they don't give them something to focus on, so both sides are literally optimizing for the effect to keep their extremists engaged.

seth-herd on Scissors Statements for President?

I think this idea is worth exploring. The first bit seems pretty easy to convey and get people to listen to:

"try exploring the hypothesis that Y-voters are not so much in favor of Y, as against X -- and that you're right about the problems with Y...

But the second bit

... but they might be able to see something that you and almost everyone you talk to is systematically blinded to about X."

sounds like a very bitter pill to swallow, and therefore hard to get people to listen to.

I think motivated reasoning [? · GW] effects turn our attention quickly away from ideas we think are "bad" on an emotional level. These might thought of as low-level ugh fields [? · GW] around those concepts. Steve Byrnes excellent work on valence in the brain and mind [LW · GW] can be read as an explanation for motivated reasoning and resulting polarization, and I highly recommend doing so. I had reached essentially identical conclusions after some years of studying cognitive biases from the perspective of brain mechanisms, but I haven't yet gotten around to the substantial task of writing it up well enough to be useful. I think it's by far the most important cognitive bias. Scott Alexander says in his review of the scout mindset:

Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias...

I think this is half right; motivated reasoning overlaps highly with confirmation bias, since most (all) of the things we believe currently are things we think are good to believe. But it's subtly different, particularly when we think about how to work around it, either in our own minds or when communicating with others.

For instance, deep canvassing appears to sidestep motivated reasoning by focusing on a personal connection, and to actually work to change minds on political issues (at least acceptance of LGBTQ issues), and according to scant available data it seems like the best known method for actually changing beliefs. It works on an emotional level, presenting no arguments, just a pleasant conversation with someone from a group. It lets people do the work of changing their own minds - as an honest, rational approach should. The specifics of deep canvassing might be limited to opinions about groups, but its success might be a guide to developing other approaches. Not directly asking someone to consider adopting a belief they dislike on an instinctive/unconscious level seems like a sensible starting point.

Applying that to your specific proposal: Perhaps something more like "Y-voters are not so much in favor of Y, as against X ... You probably agree that X can be a problem; they're just estimating it as way worse than you are. Here are some things they're worried about. Maybe they're wrong that those things could easily happen, but you can see why they'd want to prevent those things from happening."

This might work for some people, since they don't actually like the possible consequences, and don't have strong beliefs about the abstract or complex theories of how those very bad outcomes might come to pass.

That might still set of emotional/valence alarms if it brings up the concept of giving ground to ones' opponents.

Anyway, I think it's possible to create useful political/cognitive discourse if it's done carefully and with an understanding of the psychological forces involved. I'd be interested in being involved if some LWers want to workshop ideas along these lines.

sharmake-farah on Anthropic: Three Sketches of ASL-4 Safety Case Components

I'm somewhat optimistic on this happening, conditional on considerable effort being invested.

As always, we will need more work on this agenda, and there will be more information about what control techniques work in practice, and which don't.

d0themath on The Median Researcher Problem

The technology's in a sweet spot where a custom statistical analysis needs to be developed, but it's also so important that the best minds will do that analysis and a community norm exists that we defer to them. Example: clinical trial results.

The argument seems to be about this stage, and from what I've heard clinical trials indeed take so much more time than is necessary. But maybe I've only heard about medical clinical trials, and actually academic biomedical clinical trials are incredibly efficient by comparison.

It also sounds like "community norm exists that we defer to [the best minds]" requires the community to identify who the best minds are, which presumably involves critiquing the research outputs of those best minds according to the standards of the median researcher, which often (though I don't know about biomedicine) ends up being something crazy like h-index or number of citations or number of papers or derivatives of such things.

dentin on Scissors Statements for President?

My belief is that it's primarily the voting system that causes this. (Not the electoral college; rather the whole 'first past the post' style of voting.) We see scissors presidents because that's the winning strategy.

I suspect that other more sophisticated voting systems (even just ranked choice!) would do better. No voting system is perfect, but 'first past the post' is particularly pathological.

eggsyntax on LLM Generality is a Timeline Crux

Thanks for sharing, I hadn't seen those yet! I've had too much on my plate since o1-preview came out to really dig into it, in terms of either playing with it or looking for papers on it.

How much does o1-preview update your view? It's much better at Blocksworld for example.

Quite substantially. Substantially enough that I'll add mention of these results to the post. I saw the near-complete failure of LLMs on obfuscated Blocksworld problems as some of the strongest evidence against LLM generality. Even more substantially since one of the papers is from the same team of strong LLM skeptics (Subbarao Kambhampati's) who produced the original results (I am restraining myself with some difficulty from jumping up and down and pointing at the level of goalpost-moving in the new paper).

There's one sense in which it's not an entirely apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem (in that way it's more like Ryan's hybrid approach to ARC-AGI). But since the key question here is whether LLMs are capable of general reasoning at all, that doesn't really change my view; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.

Here's a first pass on how much this changes my numeric probabilities -- I expect these to be at least a bit different in a week as I continue to think about the implications (original text italicized for clarity):

LLMs continue to do better at block world and ARC as they scale: 75% -> 100%, this is now a thing that has happened (note that o1-preview also showed substantially improved results on ARC-AGI).
LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan's: 10% -> 20%, this still seems quite unlikely to me (especially since hybrid approaches have continued to improve on ARC). Most of my additional credence is on something like 'the full o1 turns out to already be close to the grand prize mark' and the rest on 'OpenAI capabilities researchers manage to use the full o1 to find an improvement to current LLM technique (eg a better prompting approach) that can be easily fixed'.
Scaffolding & tools help a lot, so that the next gen^[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark^[8]: 60% -> 75% -- I'm tempted to put it higher, but it wouldn't be that surprising if o1-mark-2 didn't quite get there even with scaffolding/tools, especially since we don't have clear insight into how much harder the full test set is.
Same but for the gen after that (GPT-6, Claude 5): 75% -> 90%? I feel less sure about this one than the others; it sure seems awfully likely that o2 plus scaffolding will be able to do it! But I'm reluctant to go past 90% because progress could level off because of training data requirements, maybe the o1 -> o2 jump doesn't focus on optimizing for general reasoning, etc. It seems very plausible that I'll bump this higher on reflection.
The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty^[9] -> 80%. That sure does seem like the world we're living in. It's not clear to me that o1 couldn't already do original AI research with the right scaffolding. Sakana claims to have gotten there with GPT-4o / Sonnet, but their claims seem overblown to me.

Now that I've seen these, I'm going to have to think hard about whether my upcoming research projects in this area (including one I'm scheduled to lead a team on in the spring, uh oh) are still the right thing to pursue. I may write at least a brief follow-up post to this one arguing that we should all update on this question.

Thanks again, I really appreciate you drawing my attention to these.

buck on Anthropic: Three Sketches of ASL-4 Safety Case Components

What do you think of the arguments in this post that it's possible to make safety cases that don't rely on the model being unlikely to be a schemer?