LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

seth-herd on The Problem With the Word ‘Alignment’

I think you're right about these drawbacks of using the term "alignment" so broadly. And I agree that more work and attention should be devoted to specifying how we suppose these concepts relate to each other. In my experience, far too little effort is devoted to placing scientific work within its broader context. We cannot afford to waste effort in working on alignment.

I don't see a better alternative, nor do you suggest one. My preference in terminology is to simply use more specification, rather than trying to get anyone to change the terminology they use. With that in mind, I'll list what I see as the most common existing terminology for each of the sub-problems.

P1: Avoiding takeover from emergent optimization in AI agents

Best term in use: AInotkilleveryoneism. I disagree that alignment is commonly misused for this.

I don't think I've heard these termed alignment, outside of the assumption you mention in the Berkeley model of value alignment (P5) as the only way of avoiding takeover (P1). P1 has been termed "the control problem" which encompasses value alignment. Which is good. This does not fit the intuitive definition of alignment. The deliberately clumsy term "AInotkilleveryoneism" seems good for this, in any context you can get away with it. Your statement seems good otherwise.

P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us

Best term in use: interpretability

This is more commonly called interpretability, but I agree that it's commonly lumped into "alignment work" without carefully examining just how it fits in. But it does legitimately fit into P1 (which shouldn't be called alignment, as well as (what I think you mean by) P3, P5, and P6, which do fit the intuitive meaning of "alignment." Thus, it does seem like this deserves the term "alignment work" as well as its more precise term of interpretability. So this seems about right, with the caveat of wanting more specificity. As it happens, I just now published a post on exactly this [LW · GW].

P3: Ensuring AIs are good at solving problems as specified (by user or designer)

Best term in use: None. AI safety?

I think you mean to include ensuring AIs also do not do things their designers don't want. I suggest changing your description, since that effort is more often called alignment and accused of safety-washing.

This is the biggest offender. The problem is that "alignment is intuitively appealing. I'd argue that this is completely wrong: you can't align a system with goals (humans) with a tool without goals (LLMs). A sword or an axe are not aligned with their wielders; they certainly lead to more trees cut down and people stabbed, but they do not intend those things, so there's a type error in saying they are aligned with their users goals.

But this is pedantry that will continue to be ignored. I don't have a good idea for making this terminology clear. The term AGI at one point was used to specify AI with agency and goals, and thus which would be alignable with human goals, but it's been watered down. We need a replacement. And we need a better term for "aligning" AIs that are not at all dangerous in the severe way the "alignment problem" terminology was intended to address. Or a different term for doing the important work of aligning agentic, RSI-capable AGI.

P4: Ensuring AI systems enhance, and don’t erode, human agency

What? I'd drop this and just consider it a subset of P6. Maybe this plays a bigger role and gets the term alignment more than I know? Do you have examples?

P5: Ensuring that advanced AI agents learn a human utility function

Best term in use: value alignment OR technical alignment.

I think these deserve their own categories in your terminology, because they overlap - technical alignment could be limited to making AGIs that follow instructions. I have been thinking about this a lot. I agree with your analysis that this is what people will probably do, for economic reasons; but I think there are powerful practical reasons that this is much easier than full value alignment, which will be a valuable excuse to align it to follow instructions from its creators. I recently wrote up that logic. This conclusion raises another problem that I think deserves to join the flock of related alignment problems: the societal alignment problem. If some humans have AGIs aligned to their values (likely through their intent/instructions), how can we align society to avoid resulting disasters from AGI-powered conflict?

P6: Ensuring that AI systems lead to desirable systemic and long term outcomes

Best term in use: I don't think there is one. Any ideas?

yanni-kyriacos on yanni's Shortform

Two jobs in AI Safety Advocacy that AFAICT don't exist, but should and probably will very soon. Will EAs be the first to create them though? There is a strong first mover advantage waiting for someone -

1. Volunteer Coordinator - there will soon be a groundswell from the general population wanting to have a positive impact in AI. Most won't know how to. A volunteer manager will help capture and direct their efforts positively, for example, by having them write emails to politicians [EA(p) · GW(p)]

2. Partnerships Manager - the President of the Voice Actors guild reached out to me recently. We had a very surprising number of cross over in concerns and potential solutions. Voice Actors are the canary in the coal mine. More unions (etc) will follow very shortly. I imagine within 1 year there will be a formalised group of these different orgs advocating together.

eggsyntax on eggsyntax's Shortform

Anthropic's new paper 'Mapping the Mind of a Large Language Model' is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).

The paper (which I'm still reading, it's not short) updates me somewhat toward 'SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence^[1].' As I read I'm trying to think through what I would have to see to be convinced of that hypothesis. I'm not expert here! I'm posting my thoughts mostly to ask for feedback about where I'm wrong and/or what I'm missing.Remaining gaps I've thought of so far:

What's lurking in the remaining reconstruction loss? Are there important missing features?
- Will SAEs get all meaningful features given adequate dictionary size?
- Are there important features which SAEs just won't find because they're not that sparse?
Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems?
- How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be 'ability to predict model output given context + feature activations'?
Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs?
- eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email
- eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition
Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features
- (as opposed to the paper under discussion, where 'The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive')

Of course LLM alignment isn't necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I'm just thinking here about what I'd want to see to feel confident that we could use these techniques to do the LLM alignment portion.

^{^}
I think I'd be pretty surprised if it kept working much past human-level, although I haven't spent a ton of time thinking that through as yet.

declan-molony on My Dating Heuristic

My original intent was in talking about how I shouldn't use information I found online about the other person to try to impress them (e.g., I find out they used to play volleyball, then slip into conversation that I like volleyball). It makes things messy.

What you're talking about is whether one should try to impress their date in general. In this case, my dating heuristic on what an emotionally healthy person would do is up for narrow interpretation. I say 'narrow' because lying to impress someone would be out of scope for the heuristic. But your interpretation (trying to authentically impress and attract someone), I think most people would agree, would be within scope.

Personally, I try to live a diverse lifestyle and let the other person decide on their own if they find me impressive.

lorxus on jacquesthibs's Shortform

I am very very vaguely in the Natural Abstractions area of alignment approaches. I'll give this paper a closer read tomorrow (because I promised myself I wouldn't try to get work done today) but my quick quick take is - it'd be huge if true, but there's not much more than that there yet, and it also has no argument that even if representations are converging for now, that it'll never be true that (say) adding a whole bunch more effectively-usable compute means that the AI no longer has to chunk objectspace into subtypes rather than understanding every individual object directly.

seth-herd on The Problem With the Word ‘Alignment’

I think you're right about these drawbacks of the widespread use of the term alignment for

robert-cousineau on What would stop you from paying for an LLM?

I stopped paying for chatGPT earlier this week, while thinking about the departure of Jan and Daniel.

Whereas before they left I was able to say to myself "well, there are smarter people than me with worldviews similar to mine who have far more information about openAI than me, and they think it is not a horrible place, so 20 bucks a month is probably fine", I am no longer able to do that.

They have explicitly sounded the best alarm they reasonably know how to currently. I should listen!

dweomite on Ilya Sutskever and Jan Leike resign from OpenAI [updated]

I am not a lawyer, and my only knowledge of this agreement comes from the quote above, but...if the onboarding paperwork says you need to sign "a" general release, but doesn't describe the actual terms of that general release, then it's hard for me to see an interpretation that isn't either toothless or crazy:

If you interpret it to mean that OpenAI can write up a "general release" with absolutely any terms they like, and you have to sign that or lose your PPUs, then that seems like it effectively means you only keep your PPUs at their sufferance, because they could simply make the terms unconscionable. (In general, any clause that requires you to agree to "something" in the future without specifying the terms of that future agreement is a blank check.)
If you interpret it to mean either that the employee can choose the exact terms, or that the terms must be the bare minimum that would meet the legal definition of "a general release", then that sounds like OpenAI has no actual power to force the non-disclosure or non-disparagement terms--although they could very plausibly trick employees into thinking they do, and threaten them with costly legal action if they resist. (And once the employee has fallen for the trick and signed the NDA, the NDA itself might be enforceable?)
Where else are the exact terms of the "general release" going to come from, if they weren't specified in advance and neither party has the right to choose them?

vladimir_nesov on Anthropic: Reflections on our Responsible Scaling Policy

if the contributor has built something consistently or overall harmful that is indeed on them

I agree, this is in accord with the dogma. But for AI, overall harm is debatable and currently purely hypothetical, so this doesn't really apply. There is a popular idea that existential risk from AI has little basis in reality since it's not already here to be observed. Thus contributing to public AI efforts remains fine (which on first order effects is perfectly fine right now).

My worry is that this attitude reframes commitments from RSP-like documents, so that people don't see the obvious implication of how releasing weights breaks the commitments (absent currently impossible feats of unlearning), and don't see themselves as making a commitment to avoid releasing high-ASL weights even as they commit to such RSPs. If this point isn't written down, some people will only become capable of noticing it if actual catastrophes shift the attitude to open weights foundation models as harmful overall (even after we already get higher up in ASLs). Which doesn't necessarily happen even if there are some catastrophes with a limited blast radius, since they get to be balanced out by positive effects.

habryka4 on Mitigating extreme AI risks amid rapid progress [Linkpost]

Yeah, I agree. I do think it's unlikely he was a major contributor to his paper, so it's more about endorsement. Agree that if someone did serious work on a paper and then dies, they should probably still be included (though IMO they should be included with an explicit footnote saying they died during the writing of the paper and might not endorse everything in the final version).

LessWrong 2.0 Reader

Archive

Recent comments