LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

Minimal Motivation of Natural Latents
johnswentworth · 2024-10-14T22:51:58.125Z · comments (14)

Motivation control
Joe Carlsmith (joekc) · 2024-10-30T17:15:50.881Z · comments (7)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

[link] Analyzing how SAE features evolve across a forward pass
bensenberner · 2024-11-07T22:07:02.827Z · comments (0)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]
Ruby · 2024-09-19T01:35:02.999Z · comments (12)

[link] cancer rates after gene therapy
bhauth · 2024-10-16T15:32:53.949Z · comments (0)

Time Efficient Resistance Training
romeostevensit · 2024-10-07T15:15:44.950Z · comments (8)

Australian AI Safety Forum 2024
Liam Carroll (liam-carroll) · 2024-09-27T00:40:11.451Z · comments (0)

Startup Success Rates Are So Low Because the Rewards Are So Large
AppliedDivinityStudies (kohaku-none) · 2024-10-10T20:22:01.557Z · comments (6)

Formalizing the Informal (event invite)
abramdemski · 2024-09-10T19:22:53.564Z · comments (0)

Unit economics of LLM APIs
dschwarz · 2024-08-27T16:51:22.692Z · comments (0)

MATS AI Safety Strategy Curriculum v2
DanielFilan · 2024-10-07T22:44:06.396Z · comments (6)

A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed
johnswentworth · 2024-08-22T19:19:28.940Z · comments (4)

[link] you should probably eat oatmeal sometimes
bhauth · 2024-08-25T14:50:37.570Z · comments (32)

[link] IAPS: Mapping Technical Safety Research at AI Companies
Zach Stein-Perlman · 2024-10-24T20:30:41.159Z · comments (12)

[link] Things I learned talking to the new breed of scientific institution
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-29T14:00:14.844Z · comments (6)

D&D Sci Coliseum: Arena of Data
aphyer · 2024-10-18T22:02:54.305Z · comments (23)

[link] An Interactive Shapley Value Explainer
James Stephen Brown (james-brown) · 2024-09-28T05:01:21.169Z · comments (9)

Reflections on the Metastrategies Workshop
gw · 2024-10-24T18:30:46.255Z · comments (5)

[link] What's important in "AI for epistemics"?
Lukas Finnveden (Lanrian) · 2024-08-24T01:27:06.771Z · comments (0)

[link] [Paper] Programming Refusal with Conditional Activation Steering
Bruce W. Lee (bruce-lee) · 2024-09-11T20:57:08.714Z · comments (0)

[link] Point of Failure: Semiconductor-Grade Quartz
Annapurna (jorge-velez) · 2024-09-30T15:57:40.495Z · comments (8)

[link] What Ketamine Therapy Is Like
Sable · 2024-11-11T11:09:08.602Z · comments (4)

[link] [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
Leon Lang (leon-lang) · 2024-10-22T13:57:41.125Z · comments (0)

Monthly Roundup #23: October 2024
Zvi · 2024-10-16T13:50:05.869Z · comments (12)

Winners of the Essay competition on the Automation of Wisdom and Philosophy
AI Impacts (AI Imacts) · 2024-10-28T17:10:04.272Z · comments (3)

[question] Implications of China's recession on AGI development?
Eric Neyman (UnexpectedValues) · 2024-09-28T01:12:36.443Z · answers+comments (3)

Metastatic Cancer Treatment Since 2010: The Success Stories
sarahconstantin · 2024-11-04T22:50:09.386Z · comments (0)

[Linkpost] Play with SAEs on Llama 3
Tom McGrath · 2024-09-25T22:35:44.824Z · comments (2)

2025 Color Trends
sarahconstantin · 2024-10-07T21:20:03.962Z · comments (7)

instruction tuning and autoregressive distribution shift
nostalgebraist · 2024-09-05T16:53:41.497Z · comments (5)

Anthropic rewrote its RSP
Zach Stein-Perlman · 2024-10-15T14:25:12.518Z · comments (19)

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde (kola-ayonrinde) · 2024-08-23T18:52:31.019Z · comments (5)

Signaling with Small Orange Diamonds
jefftk (jkaufman) · 2024-11-07T20:20:08.026Z · comments (1)

Are we dropping the ball on Recommendation AIs?
Charbel-Raphaël (charbel-raphael-segerie) · 2024-10-23T17:48:00.000Z · comments (16)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

[link] AISafety.info: What is the "natural abstractions hypothesis"?
Algon · 2024-10-05T12:31:14.195Z · comments (2)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

[Intuitive self-models] 7. Hearing Voices, and Other Hallucinations
Steven Byrnes (steve2152) · 2024-10-29T13:36:16.325Z · comments (2)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (7)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sunwillrise on AI Craftsmanship

So, where are the Knuths of the modern era? Why is modern AI dominated by the Lorem Epsoms of the world? Where is the craftsmanship? Why are our AI tools optimized for seeming good, rather than being good?
[2] Remember back in 2013 when the talk of the town was how vector representations of words learned by neural networks represent rich semantic information? So you could do cool things like take the [male] vector, subtract the [female] vector, add the [king] vector, and get out something close to the [queen] vector? That was cool! Where's the stuff like that these days?

I'm a bit confused by your confusion, and by the fact that your post does not contain what seems to me like the most straightforward explanation of these phenomena. An explanation that I am almost fully certain you are aware of, and which seems to be almost universally agreed upon by those interested (at any level) in interpretability in ML.

Namely the fact that, starting in the 2010s, it happened to be the case (for a ton of historically contingent reasons) that top AI companies (at the beginning, and followed by other ML hubs and researchers afterwards) realized the bitter lesson is basically correct: attempts to hard-code human knowledge or intuition into frontier models ultimately always harm their performance in the long-term compared to "literally just scale the model with more data and compute." This led to a focus, among experts and top engineers, on figuring out scaling laws [LW · GW], ways of improving the quality and availability of data (perhaps through synthetic generation methods) [LW · GW], ways of creating better end-user products [LW(p) · GW(p)] through stuff like fine-tuning and RLHF, etc, instead of the older GOFAI stuff of trying to figure out at a deeper level what is going on inside the model.

Another way of saying this is that top researchers and companies ultimately stumbled on an AI paradigm which increased capabilities significantly more than had been achievable previously, but at the cost of strongly decoupling "capability improvements" and "interpretability improvements" as distinct things that researchers and engineers could focus on. It's not that capability and interpretability were necessarily tightly correlated in the past; that is not the claim I am making. Rather, I am saying that in the pre-(transformer + RL) era, the way you generated improvements in your models/AI was by figuring out specific issues and analyzing them deeply to find out how to get around them [LW · GW], whereas now, a far simpler, easier, less insight-intensive approach became available: literally just scaling up the model with more data and compute.

So the basic point is that you no longer see all this cool research on the internal representations that models generate of high-dimensional data like word embeddings (such as the word2vec stuff you are referring to in the second footnote) because you no longer have nearly as much of a need for these insights in order to improve the capabilities/performance of the AI tools currently in use. It's fundamentally an issue with demand, not with supply. And the demand from the interpretability-focused AI alignment community is just nowhere close to large enough to bridge the gap and cover the loss generated by the shift in paradigm focus and priorities among the capabilities/"normie" AI research community.

Indeed, the notion that nowadays, the reason you no longer have deep thinkers who try to figure out what is going on or are "motivated by reasons" [LW(p) · GW(p)] in how they approach these issues, is somehow because "careful thinkers read LessWrong and decided against contributing to AI progress," seems... rather ridiculous to me? It's not like I enjoy responding to an important question that you are asking with derision in lieu of a substantive response, but... I mean, the literal authors of the word2vec paper you cited were AI (capabilities) researchers working at top companies, not AI alignment researchers! Sure, some people like Bengio [LW · GW] and Hofstadter [LW · GW] (less relevant in practical terms) who are obviously not "LARP-ing impostors" [LW · GW] in Wentworth's terminology have made the shift from capabilities work to trying to raise public awareness of alignment/safety/control problems. But the vast majority (according to personal experience, general impressions, as well as the current state of the discourse on these topics) absolutely have not, and since they were the ones generating the clever insights back in the day, of course it makes sense that the overall supply of these insights has gone down.

I just really don't see how it could be the case that "people refuse to generate these insights because they have been convinced by AI safety advocates that it would dangerously increase capabilities and shorten timelines" and "people no longer generate these insights as much because they are instead focusing on other tasks that improve model capabilities more rapidly and robustly, given the shifted paradigm" are two hypotheses that can be given similar probabilities in any reasonable person's mind. The latter should be at least a few orders of magnitude more likely than the former, as I see it.

seth-herd on The Hopium Wars: the AGI Entente Delusion

Well, that's disturbing. I'm curious what you mean by "soon"' for autonomous continuous improvement, and what mechanism you're envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it's fairly slow and seems to have upper bounds.

As for the rate of cyber security and eval improvement, I agree that it's not on track. I wouldn't be surprised if it's not on track, and we'll actually see the escape you're talking about.

My one hope here is that the rate of improvement isn't on rails; it's in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI [LW · GW]. The point is that we shouldn't assume that just because nobody finds LLMs dangerous, they won't find AGI or even proto-AGI obviously and intuitively dangerous.

Again, I'm not at all sure this happens in time on the current trajectory. But it's possible for the folks at the lab to say "we're going to deploy it internally, but let's improve our evals and our network security first, because this could be the real deal".

It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.

The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.

All in all, I expect that if we launch misaligned proto-AGI, we're probably all dead. I agree that people are all too likely to launch it before they're sure if it's aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they're in place before it's even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it's succeeded). Since those are already how LLMs and agents are built, there's little chance the org doesn't at least try them.

That might sound like a forlorn hope; but the target isn't perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don't have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.

It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it's safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.

deepthoughtlife on Scissors Statements for President?

To be pedantic, my model is pretty obvious, and clearly gives this prediction, so you can't really say that you don't see a model here, you just don't believe the model. Your model with extra assumptions doesn't give this prediction, but the one I gave clearly does.

You can't find a person this can't be done to because there is something obviously wrong with everyone? Things can be twisted easily enough. (Offense is stronger than defense here.) If you didn't find it, you just didn't look hard/creatively enough. Our intuitions against people tricking us aren't really suitable defense against sufficiently optimized searching. (Luckily, this is actually hard to do so it is pretty confined most of the time to major things like politics.) Also, very clearly, you don't actually have to convince all that many people for this to work! If even 20% of people really bought it, those people would probably vote and give you an utter landslide if the other side didn't do the same thing (which we know they do, just look at how divisive candidates obviously are!)

gilch on gilch's Shortform

Yes.

david-matolcsi on o1 is a bad idea

GPT4 does not engage in the sorts of naive misinterpretations which were discussed in the early days of AI safety. If you ask it for a plan to manufacture paperclips, it doesn't think the best plan would involve converting all the matter in the solar system into paperclips.

I'm somewhat surprised by this paragraph. I thought the MIRI position was that they did not in fact predict AIs behaving like this, and the behavior of GPT4 was not an update at all for them. See this comment [LW(p) · GW(p)] by Eliezer. I mostly bought that MIRI in fact never worried about AIs going rouge based on naive misinterpretations, so I'm surprised to see Abram saying the opposite now.

Abram, did you disagree about this with others at MIRI, so the behavior of GPT4 was an update for you but not for them, or do you think they are misremembering/misconstructing their earlier thoughts on this matter, or is there a subtle distinction here that I'm missing?

nathan-helm-burger on The Hopium Wars: the AGI Entente Delusion

You say that it's not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.

It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn't seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.

I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.

Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).

bogdan-ionut-cirstea on o1 is a bad idea

So, to the extent that the chain-of-thought helps produce a better answer in the end, we can conclude that this is "basically" improved due to the actual semantic reasoning which the chain-of-thought apparently implements.

I like the intuition behind this argument, which I don't remember seeing spelled out anywhere else before.

I wonder how much hope one should derive from the fact that, intuitively, RL seems like it should be relatively slow at building new capabilities from scratch / significantly changing model internals, so there might be some way to buy some safety from also monitoring internals (both for dangerous capabilities already existant after pretraining, and for potentially new ones [slowly] built through RL fine-tuning). Related passage with an at least somewhat similar intuition, from https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit#How_to_Catch_an_LLM_in_the_Act_of_Repurposing_Deceitful_Behaviors [LW · GW] (the post also discusses how one might go about monitoring for dangerous capabilities already existant after pretraining):

If you are concerned about the possibility the model might occasionally instead reinvent deceit from scratch, rather than reusing the copy already available (which would have to be by chance rather than design, since it can't be deceitful before acquiring deceit), then the obvious approach would be to attempt to devise a second Interpretablity (or other) process to watch for that during training, and use it alongside the one outlined here. Fortunately that reinvention is going to be a slower process, since it has to do more work, and it should at first be imperfect, so it ought to be easier to catch in the act before the model gets really good at deceit.

cstinesublime on CstineSublime's Shortform

Yes I assumed it was a conscious choice (of the company that develops an A.I.) and not a limitation of the architecture. Although I am confused by the single-turn reinforcement explanation as while this may increase the probability of any individual turn being useful, as my interaction over the hallucinated feature in Instagram attests to, it makes conversations far less useful overall unless it happens to correctly 'guess' what you mean.

sharmake-farah on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy

Secondly, and more importantly, I question whether it is possible even in theory to produce infinite expected value. At some point you've created every possible flourishing mind in every conceivable permutation of eudaimonia, satisfaction, and bliss, and the added value of another instance of any of them is basically nil. In reality I would expect to reach a point where the universe is so damn good that there is literally nothing the Cosmic Flipper could offer me that would be worth risking it all.

This very much depends on the rate of growth.

For most human beings, this is probably right, because their values have a function that grows slower than logarithmic, which leads to bounds on the utility even assuming infinite consumption.

But it's definitely possible in theory to generate utility functions that have infinite expected utility from infinite consumption.

You are however pointing to something very real here, and that's the fact that utility theory loses a lot of it's niceness in the infinite realm, and while there might be something like a utility theory that can handle infinity, it will have to lose a lot of very nice properties that it had in the finite case.

See these 2 posts by Paul Christiano for why:

https://www.lesswrong.com/posts/hbmsW2k9DxED5Z4eJ/impossibility-results-for-unbounded-utilities [LW · GW]

https://www.lesswrong.com/posts/gJxHRxnuFudzBFPuu/better-impossibility-result-for-unbounded-utilities [LW · GW]

annasalamon on Scissors Statements for President?

I mean, I see why a party would want their members to perceive the other party's candidate as having a blind spot. But I don't see why they'd be typically able to do this, given that the other party's candidate would rather not be perceived this way, the other party would rather their candidate not be perceived this way, and, naively, one might expect voters to wish not to be deluded. It isn't enough to know there's an incentive in one direction; there's gotta be more like a net incentive across capacity-weighted players, or else an easier time creating appearance-of-blindspots vs creating visible-lack-of-blindspots, or something. So, I'm somehow still not hearing a model that gives me this prediction.