LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

next page (older posts) →

Recent comments

Unfortunately I don't have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I'm not personally aware of such writings. To brainstorm some ideas:

Conduct and publish anonymous surveys of employee attitudes about safety.
Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
Officially encourage (e.g. via financial rewards) internal and external whistleblowers. Establish and publish policies about this.
Publicly make safety commitments and regularly report on their status, such as how much compute and other resources have been allocated/used by which safety teams.
Make/publish a commitment to publicly report negative safety news, which can be used as basis for whistleblowing if needed (i.e. if some manager decides to hide such news instead).

seth-herd on eggsyntax's Shortform

I wrote up a short post [LW · GW]with a summary of their results. It doesn't really answer any of your questions. I do have thoughts on a couple, even though I'm not expert on interpretability.

But my main focus is on your footnote: is this going to help much with aligning "real" AGI (I've been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I'm of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end - and we haven't really thought through which is which yet.

seth-herd on Helping loved ones with their finances: the why and how of an unusually impactful opportunity

It would be more useful with a little more info on what ideas you're offering. Linkposts with more description get more clickthrough. You can edit in a little more info.

molly on Forecasting: the way I think about it

Figure 1 is clumsy, sorry. In the case of a smooth probability distribution of infinite worlds, I think the median and the average world are the same? But in practice, yes, it's an expected value calculation, summing P(world) * P(U|world) for all the worlds you've thought about.

dpiepgrass on "No-one in my org puts money in their pension"

I can't recall another time when someone shared their personal feelings and experiences and someone else declared it "propaganda and alarmism". I haven't seen "zero-risker" types do the same, but I would be curious to hear the tale and, if they share it, I don't think anyone one will call it "propaganda and killeveryoneism".

alcherblack on What's Going on With OpenAI's Messaging?

Broadly agree except for this part:

Its in an area that some people (not the OpenAI management) think is unusually high-risk,

I really can't imagine that someone who wrote "Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity." in 2015 and occasionally references extinction as a possibility when not directly asked about doesn't think AGI development is high risk.

I'm not sure how to square this circle. I almost hope Sam is being consciously dishonest and has a 4D chess plan, as opposed to self-deluding himself that while it's dangerous the risks are low or they're somehow worth it. But it seems that the latter is more likely based on some other stuff he said, e.g. "What I lose the most sleep over is the hypothetical idea that we already have done something really bad by launching ChatGPT".

seth-herd on The Problem With the Word ‘Alignment’

I think you're right about these drawbacks of using the term "alignment" so broadly. And I agree that more work and attention should be devoted to specifying how we suppose these concepts relate to each other. In my experience, far too little effort is devoted to placing scientific work within its broader context. We cannot afford to waste effort in working on alignment.

I don't see a better alternative, nor do you suggest one. My preference in terminology is to simply use more specification, rather than trying to get anyone to change the terminology they use. With that in mind, I'll list what I see as the most common existing terminology for each of the sub-problems.

P1: Avoiding takeover from emergent optimization in AI agents

Best term in use: AInotkilleveryoneism. I disagree that alignment is commonly misused for this.

I don't think I've heard these termed alignment, outside of the assumption you mention in the Berkeley model of value alignment (P5) as the only way of avoiding takeover (P1). P1 has been termed "the control problem" which encompasses value alignment. Which is good. This does not fit the intuitive definition of alignment. The deliberately clumsy term "AInotkilleveryoneism" seems good for this, in any context you can get away with it. Your statement seems good otherwise.

P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us

Best term in use: interpretability

This is more commonly called interpretability, but I agree that it's commonly lumped into "alignment work" without carefully examining just how it fits in. But it does legitimately fit into P1 (which shouldn't be called alignment, as well as (what I think you mean by) P3, P5, and P6, which do fit the intuitive meaning of "alignment." Thus, it does seem like this deserves the term "alignment work" as well as its more precise term of interpretability. So this seems about right, with the caveat of wanting more specificity. As it happens, I just now published a post on exactly this [LW · GW].

P3: Ensuring AIs are good at solving problems as specified (by user or designer)

Best term in use: None. AI safety?

I think you mean to include ensuring AIs also do not do things their designers don't want. I suggest changing your description, since that effort is more often called alignment and accused of safety-washing.

This is the biggest offender. The problem is that "alignment is intuitively appealing. I'd argue that this is completely wrong: you can't align a system with goals (humans) with a tool without goals (LLMs). A sword or an axe are not aligned with their wielders; they certainly lead to more trees cut down and people stabbed, but they do not intend those things, so there's a type error in saying they are aligned with their users goals.

But this is pedantry that will continue to be ignored. I don't have a good idea for making this terminology clear. The term AGI at one point was used to specify AI with agency and goals, and thus which would be alignable with human goals, but it's been watered down. We need a replacement. And we need a better term for "aligning" AIs that are not at all dangerous in the severe way the "alignment problem" terminology was intended to address. Or a different term for doing the important work of aligning agentic, RSI-capable AGI.

P4: Ensuring AI systems enhance, and don’t erode, human agency

What? I'd drop this and just consider it a subset of P6. Maybe this plays a bigger role and gets the term alignment more than I know? Do you have examples?

P5: Ensuring that advanced AI agents learn a human utility function

Best term in use: value alignment OR technical alignment.

I think these deserve their own categories in your terminology, because they overlap - technical alignment could be limited to making AGIs that follow instructions. I have been thinking about this a lot. I agree with your analysis that this is what people will probably do, for economic reasons; but I think there are powerful practical reasons that this is much easier than full value alignment, which will be a valuable excuse to align it to follow instructions from its creators. I recently wrote up that logic. This conclusion raises another problem that I think deserves to join the flock of related alignment problems: the societal alignment problem. If some humans have AGIs aligned to their values (likely through their intent/instructions), how can we align society to avoid resulting disasters from AGI-powered conflict?

P6: Ensuring that AI systems lead to desirable systemic and long term outcomes

Best term in use: I don't think there is one. Any ideas?

yanni-kyriacos on yanni's Shortform

Two jobs in AI Safety Advocacy that AFAICT don't exist, but should and probably will very soon. Will EAs be the first to create them though? There is a strong first mover advantage waiting for someone -

1. Volunteer Coordinator - there will soon be a groundswell from the general population wanting to have a positive impact in AI. Most won't know how to. A volunteer manager will help capture and direct their efforts positively, for example, by having them write emails to politicians [EA(p) · GW(p)]

2. Partnerships Manager - the President of the Voice Actors guild reached out to me recently. We had a very surprising number of cross over in concerns and potential solutions. Voice Actors are the canary in the coal mine. More unions (etc) will follow very shortly. I imagine within 1 year there will be a formalised group of these different orgs advocating together.

eggsyntax on eggsyntax's Shortform

Anthropic's new paper 'Mapping the Mind of a Large Language Model' is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).

The paper (which I'm still reading, it's not short) updates me somewhat toward 'SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence^[1].' As I read I'm trying to think through what I would have to see to be convinced of that hypothesis. I'm not expert here! I'm posting my thoughts mostly to ask for feedback about where I'm wrong and/or what I'm missing.Remaining gaps I've thought of so far:

What's lurking in the remaining reconstruction loss? Are there important missing features?
- Will SAEs get all meaningful features given adequate dictionary size?
- Are there important features which SAEs just won't find because they're not that sparse?
Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems?
- How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be 'ability to predict model output given context + feature activations'?
Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs?
- eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email
- eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition
Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features
- (as opposed to the paper under discussion, where 'The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive')

Of course LLM alignment isn't necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I'm just thinking here about what I'd want to see to feel confident that we could use these techniques to do the LLM alignment portion.

^{^}
I think I'd be pretty surprised if it kept working much past human-level, although I haven't spent a ton of time thinking that through as yet.

declan-molony on My Dating Heuristic

My original intent was in talking about how I shouldn't use information I found online about the other person to try to impress them (e.g., I find out they used to play volleyball, then slip into conversation that I like volleyball). It makes things messy.

What you're talking about is whether one should try to impress their date in general. In this case, my dating heuristic on what an emotionally healthy person would do is up for narrow interpretation. I say 'narrow' because lying to impress someone would be out of scope for the heuristic. But your interpretation (trying to authentically impress and attract someone), I think most people would agree, would be within scope.

Personally, I try to live a diverse lifestyle and let the other person decide on their own if they find me impressive.

LessWrong 2.0 Reader

Archive

Recent comments