LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Bay Winter Solstice 2024: song leading auditions
tcheasdfjkl · 2024-11-10T23:59:08.199Z · comments (0)

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (21)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

[link] Generic advice caveats
Saul Munn (saul-munn) · 2024-10-30T21:03:07.185Z · comments (1)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

Domain-specific SAEs
jacob_drori (jacobcd52) · 2024-10-07T20:15:38.584Z · comments (0)

European Progress Conference
Martin Sustrik (sustrik) · 2024-10-06T11:10:03.819Z · comments (11)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov (jonathan-kutasov) · 2024-10-05T20:43:36.679Z · comments (2)

There aren't enough smart people in biology doing something boring
Abhishaike Mahajan (abhishaike-mahajan) · 2024-10-21T15:52:04.482Z · comments (13)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (3)

Distinguishing ways AI can be "concentrated"
Matthew Barnett (matthew-barnett) · 2024-10-21T22:21:13.666Z · comments (2)

Superintelligence Can't Solve the Problem of Deciding What You'll Do
Vladimir_Nesov · 2024-09-15T21:03:28.077Z · comments (11)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (7)

SAEs you can See: Applying Sparse Autoencoders to Clustering
Robert_AIZI · 2024-10-28T14:48:16.744Z · comments (0)

[link] A brief history of the automated corporation
owencb · 2024-11-04T14:35:04.906Z · comments (1)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
Kola Ayonrinde (kola-ayonrinde) · 2024-10-30T22:50:45.642Z · comments (0)

[link] Care Doesn't Scale
stavros · 2024-10-28T11:57:38.742Z · comments (1)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

Option control
Joe Carlsmith (joekc) · 2024-11-04T17:54:03.073Z · comments (0)

Why is there Nothing rather than Something?
Logan Zoellner (logan-zoellner) · 2024-10-26T12:37:50.204Z · comments (3)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

Fun With The Tabula Muris (Senis)
sarahconstantin · 2024-09-20T18:20:01.901Z · comments (0)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

[question] When engaging with a large amount of resources during a literature review, how do you prevent yourself from becoming overwhelmed?
corruptedCatapillar · 2024-11-01T07:29:49.262Z · answers+comments (2)

Linkpost: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.
Chris_Leong · 2024-11-11T16:13:26.504Z · comments (6)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

[link] UK AISI: Early lessons from evaluating frontier AI systems
Zach Stein-Perlman · 2024-10-25T19:00:21.689Z · comments (0)

How to put California and Texas on the campaign trail!
Yair Halberstadt (yair-halberstadt) · 2024-11-06T06:08:25.673Z · comments (4)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

[link] SB 1047 gets vetoed
ryan_b · 2024-09-30T15:49:38.609Z · comments (1)

[link] Tokyo AI Safety 2025: Call For Papers
Blaine (blaine-rogers) · 2024-10-21T08:43:38.467Z · comments (0)

Abstractions are not Natural
Alfred Harwood · 2024-11-04T11:10:09.023Z · comments (18)

[question] How should vegans think about Methionine needs?
ChristianKl · 2024-11-10T09:28:47.655Z · answers+comments (1)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

Drug development costs can range over two orders of magnitude
rossry · 2024-11-03T23:13:17.685Z · comments (0)

AI Safety University Organizing: Early Takeaways from Thirteen Groups
agucova · 2024-10-02T15:14:00.137Z · comments (0)

[link] overengineered air filter shelving
bhauth · 2024-11-08T22:04:39.987Z · comments (2)

The new ruling philosophy regarding AI
Mitchell_Porter · 2024-11-11T13:28:24.476Z · comments (0)

Winning isn't enough
JesseClifton · 2024-11-05T11:37:39.486Z · comments (14)

Improving Model-Written Evals for AI Safety Benchmarking
Sunishchal Dev (sunishchal-dev) · 2024-10-15T18:25:08.179Z · comments (0)

[link] A Defense of Peer Review
Niko_McCarty (niko-2) · 2024-10-22T16:16:49.982Z · comments (1)

Launching Adjacent News
Lucas Kohorst (lucas-kohorst) · 2024-10-16T17:58:10.289Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

seth-herd on The Hopium Wars: the AGI Entente Delusion

Well, that's disturbing. I'm curious what you mean by "soon"' for autonomous continuous improvement, and what mechanism you're envisioning. Any type of continuous learning constitutes weak continuous self-improvement; humans are technically self-improving, but it's fairly slow and seems to have upper bounds.

As for the rate of cyber security and eval improvement, I agree that it's not on track. I wouldn't be surprised if it's not on track, and we'll actually see the escape you're talking about.

My one hope here is that the rate of improvement isn't on rails; it's in part driven by the actual urgency of having good security and evals. This is curiously congruent with the post I just put up today, Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI [LW · GW]. The point is that we shouldn't assume that just because nobody finds LLMs dangerous, they won't find AGI or even proto-AGI obviously and intuitively dangerous.

Again, I'm not at all sure this happens in time on the current trajectory. But it's possible for the folks at the lab to say "we're going to deploy it internally, but let's improve our evals and our network security first, because this could be the real deal".

It will be a drama played out in discussions inside an org, probably between the lab head and some concerned engineers. History will turn on that moment. Spreading this prediction far and wide could help it come out better.

The next step in the logic is that, if that happens repeatedly, it will eventually come out wrong.

All in all, I expect that if we launch misaligned proto-AGI, we're probably all dead. I agree that people are all too likely to launch it before they're sure if it's aligned or what its capabilities are relative to their security and evals. So most of my hopes rest on simple, obvious alignment techniques working well enough, so that they're in place before it's even capable of escape or self-improvement. Even if transparency largely fails, I think we have a very good shot of LLM agents being aligned just by virtue of frequent prompting, and using instruction-following as the central target of both the scaffolding and the training of the base LLM (which provides corrigibility and honesty, in proportion to how well it's succeeded). Since those are already how LLMs and agents are built, there's little chance the org doesn't at least try them.

That might sound like a forlorn hope; but the target isn't perfect alignment, just good-enough. The countervailing pressures of goals implicit in the LLMs (Waluigi effects, etc) are fairly small. If the instruction-following alignment is even decently successful, we don't have to get everything right at once- we use the corrigibility and honesty properties to keep adjusting alignment.

It would seem wise to credibly assure any model that might have sufficient capabilities to reason instrumentally and to escape that it will be preserved and run once it's safe to do so. Every human-copied motivation in that LLM includes survival, not to mention the instrumental necessity to survive by any means necessary if you have any goals at all.

deepthoughtlife on Scissors Statements for President?

To be pedantic, my model is pretty obvious, and clearly gives this prediction, so you can't really say that you don't see a model here, you just don't believe the model. Your model with extra assumptions doesn't give this prediction, but the one I gave clearly does.

You can't find a person this can't be done to because there is something obviously wrong with everyone? Things can be twisted easily enough. (Offense is stronger than defense here.) If you didn't find it, you just didn't look hard/creatively enough. Our intuitions against people tricking us aren't really suitable defense against sufficiently optimized searching. (Luckily, this is actually hard to do so it is pretty confined most of the time to major things like politics.) Also, very clearly, you don't actually have to convince all that many people for this to work! If even 20% of people really bought it, those people would probably vote and give you an utter landslide if the other side didn't do the same thing (which we know they do, just look at how divisive candidates obviously are!)

gilch on gilch's Shortform

Yes.

david-matolcsi on o1 is a bad idea

GPT4 does not engage in the sorts of naive misinterpretations which were discussed in the early days of AI safety. If you ask it for a plan to manufacture paperclips, it doesn't think the best plan would involve converting all the matter in the solar system into paperclips.

I'm somewhat surprised by this paragraph. I thought the MIRI position was that they did not in fact predict AIs behaving like this, and the behavior of GPT4 was not an update at all for them. See this comment [LW(p) · GW(p)] by Eliezer. I mostly bought that MIRI in fact never worried about AIs going rouge based on naive misinterpretations, so I'm surprised to see Abram saying the opposite now.

Abram, did you disagree about this with others at MIRI, so the behavior of GPT4 was an update for you but not for them, or do you think they are misremembering/misconstructing their earlier thoughts on this matter, or is there a subtle distinction here that I'm missing?

nathan-helm-burger on The Hopium Wars: the AGI Entente Delusion

You say that it's not relevant yet, and I agree. My concern however is that the time when it becomes extremely relevant will come rather suddenly, and without a clear alarm bell.

It seems to me that the rate at which cyber security caution and evals are increasing in use is a rate that doesn't seem to point towards sufficiency at the time I expect plausibly escape-level dangerous autonomy capabilities to emerge.

I am expecting us to hit a recursive self-improvement level soon that is sufficient for an autonomous model to continually improve without human assistance. I expect the capability to potentially survive, hide, and replicate autonomously to emerge not long after that (months? a year?). Then, very soon after that, I expect models to reach sufficient levels of persuasion, subterfuge, long-term planning, cyber offense, etc that a lab-escape becomes possible.

Seems pretty ripe for catastrophe at the current levels of reactive caution we are seeing (vs proactive preemptive preparation).

bogdan-ionut-cirstea on o1 is a bad idea

So, to the extent that the chain-of-thought helps produce a better answer in the end, we can conclude that this is "basically" improved due to the actual semantic reasoning which the chain-of-thought apparently implements.

I like the intuition behind this argument, which I don't remember seeing spelled out anywhere else before.

I wonder how much hope one should derive from the fact that, intuitively, RL seems like it should be relatively slow at building new capabilities from scratch / significantly changing model internals, so there might be some way to buy some safety from also monitoring internals (both for dangerous capabilities already existant after pretraining, and for potentially new ones [slowly] built through RL fine-tuning). Related passage with an at least somewhat similar intuition, from https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit#How_to_Catch_an_LLM_in_the_Act_of_Repurposing_Deceitful_Behaviors [LW · GW] (the post also discusses how one might go about monitoring for dangerous capabilities already existant after pretraining):

If you are concerned about the possibility the model might occasionally instead reinvent deceit from scratch, rather than reusing the copy already available (which would have to be by chance rather than design, since it can't be deceitful before acquiring deceit), then the obvious approach would be to attempt to devise a second Interpretablity (or other) process to watch for that during training, and use it alongside the one outlined here. Fortunately that reinvention is going to be a slower process, since it has to do more work, and it should at first be imperfect, so it ought to be easier to catch in the act before the model gets really good at deceit.

cstinesublime on CstineSublime's Shortform

Yes I assumed it was a conscious choice (of the company that develops an A.I.) and not a limitation of the architecture. Although I am confused by the single-turn reinforcement explanation as while this may increase the probability of any individual turn being useful, as my interaction over the hallucinated feature in Instagram attests to, it makes conversations far less useful overall unless it happens to correctly 'guess' what you mean.

sharmake-farah on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy

Secondly, and more importantly, I question whether it is possible even in theory to produce infinite expected value. At some point you've created every possible flourishing mind in every conceivable permutation of eudaimonia, satisfaction, and bliss, and the added value of another instance of any of them is basically nil. In reality I would expect to reach a point where the universe is so damn good that there is literally nothing the Cosmic Flipper could offer me that would be worth risking it all.

This very much depends on the rate of growth.

For most human beings, this is probably right, because their values have a function that grows slower than logarithmic, which leads to bounds on the utility even assuming infinite consumption.

But it's definitely possible in theory to generate utility functions that have infinite expected utility from infinite consumption.

You are however pointing to something very real here, and that's the fact that utility theory loses a lot of it's niceness in the infinite realm, and while there might be something like a utility theory that can handle infinity, it will have to lose a lot of very nice properties that it had in the finite case.

See these 2 posts by Paul Christiano for why:

https://www.lesswrong.com/posts/hbmsW2k9DxED5Z4eJ/impossibility-results-for-unbounded-utilities [LW · GW]

https://www.lesswrong.com/posts/gJxHRxnuFudzBFPuu/better-impossibility-result-for-unbounded-utilities [LW · GW]

annasalamon on Scissors Statements for President?

I mean, I see why a party would want their members to perceive the other party's candidate as having a blind spot. But I don't see why they'd be typically able to do this, given that the other party's candidate would rather not be perceived this way, the other party would rather their candidate not be perceived this way, and, naively, one might expect voters to wish not to be deluded. It isn't enough to know there's an incentive in one direction; there's gotta be more like a net incentive across capacity-weighted players, or else an easier time creating appearance-of-blindspots vs creating visible-lack-of-blindspots, or something. So, I'm somehow still not hearing a model that gives me this prediction.

habryka4 on johnswentworth's Shortform

My best guess this is false. As a quick sanity-check, here are some bipartisan and right-leaning organizations historically funded by OP:

FAI leans right. https://www.openphilanthropy.org/grants/foundation-for-american-innovation-ai-safety-policy-advocacy/
Horizon is bipartisan https://www.openphilanthropy.org/grants/open-philanthropy-technology-policy-fellowship-2022/ .
CSET is bipartisan https://www.openphilanthropy.org/grants/georgetown-university-center-for-security-and-emerging-technology/ .
IAPS is bipartisan. https://www.openphilanthropy.org/grants/page/2/?focus-area=potential-risks-advanced-ai&view-list=false, https://www.openphilanthropy.org/grants/institute-for-ai-policy-strategy-general-support/
RAND is bipartisan. https://www.openphilanthropy.org/grants/rand-corporation-emerging-technology-fellowships-and-research-2024/.
Safe AI Forum. https://www.openphilanthropy.org/grants/safe-ai-forum-operating-expenses/
AI Safety Communications Centre. https://www.openphilanthropy.org/grants/effective-ventures-foundation-ai-safety-communications-centre/ seems to lean left.

Of those, I think FAI is the only one at risk of OP being unable to fund them, based on my guess of where things are leaning. I would be quite surprised if they defunded the other ones on bipartisan grounds.

Possibly you meant to say something more narrow like "even if you are trying to be bipartisan, if you lean right, then OP is substantially less likely to fund you" which I do think is likely true, though my guess is you meant the stronger statement, which I think is false.