Posts

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities 2024-02-02T05:49:11.189Z
FAQ: What the heck is goal agnosticism? 2023-10-08T19:11:50.269Z
A plea for more funding shortfall transparency 2023-08-07T21:33:11.912Z
Using predictors in corrigible systems 2023-07-19T22:29:02.742Z
One path to coherence: conditionalization 2023-06-29T01:08:14.527Z
One implementation of regulatory GPU restrictions 2023-06-04T20:34:37.090Z
porby's Shortform 2023-05-24T21:34:26.211Z
Implied "utilities" of simulators are broad, dense, and shallow 2023-03-01T03:23:22.974Z
Instrumentality makes agents agenty 2023-02-21T04:28:57.190Z
How would you use video gamey tech to help with AI safety? 2023-02-09T00:20:34.152Z
Against Boltzmann mesaoptimizers 2023-01-30T02:55:12.041Z
FFMI Gains: A List of Vitalities 2023-01-12T04:48:04.378Z
Simulators, constraints, and goal agnosticism: porbynotes vol. 1 2022-11-23T04:22:25.748Z
Am I secretly excited for AI getting weird? 2022-10-29T22:16:52.592Z
Why I think strong general AI is coming soon 2022-09-28T05:40:38.395Z
Private alignment research sharing and coordination 2022-09-04T00:01:22.337Z

Comments

Comment by porby on Why I think strong general AI is coming soon · 2024-10-27T01:03:34.176Z · LW · GW

Hey, we met at EAGxToronto : )

🙋‍♂️

So my model of progress has allowed me to observe our prosaic scaling without surprise, but it doesn't allow me to make good predictions since the reason for my lack of surprise has been from Vingean prediction of the form "I don't know what progress will look like and neither do you".

This is indeed a locally valid way to escape one form of the claim—without any particular prediction carrying extra weight, and the fact that reality has to go some way, there isn't much surprise in finding yourself in any given world.

I do think there's value in another version of the word "surprise," here, though. For example: the cross-entropy loss between the predicted distribution with respect to the observed distribution. Holding to a high uncertainty model of progress will result in continuously high "surprise" in this sense, because it struggles to narrow to a better distribution generator. It's a sort of overdamped epistemological process.

I think we have enough information to make decent gearsy models of progress around AI. As a bit of evidence, some such models have already been exploited to make gobs of money. I'm also feeling pretty good[1] about many of my predictions (like this post) that contributed to me pivoting entirely into AI; there's an underlying model that has a bunch of falsifiable consequences which has so far survived a number of iterations, and that model has implications through the development of extreme capability.

What I have been surprised about has been governmental reaction to AI...

Yup! That was a pretty major (and mostly positive) update for me. I didn't have a strong model of government-level action in the space and I defaulted into something pretty pessimistic. My policy/governance model is still lacking the kind of nuance that you only get by being in the relevant rooms, but I've tried to update here as well. That's also part of the reason why I'm doing what I'm doing now.

In any case, I've been hoping for the last few years I would have time to do my undergrad and start working on the alignment without a misaligned AI going RSI, and I'm still hoping for that. So that's lucky I guess. 🍀🐛

May you have the time to solve everything!

  1. ^

     ... epistemically

Comment by porby on My Advice for Incoming SERI MATS Scholars · 2024-07-27T14:26:00.048Z · LW · GW

I've got a fun suite of weird stuff going on[1], so here's a list of sometimes-very-N=1 data:

  1. Napping: I suck at naps. Despite being very tired, I do not fall asleep easily, and if I do fall asleep, it's probably not going to be for just 5-15 minutes. I also tend to wake up with a lot of sleep inertia, so the net effect of naps on alertness across a day tends to be negative. They also tend to destroy my sleep schedule. 
  2. Melatonin: probably the single most noticeable non-stimulant intervention. While I'm by-default very tired all the time, it's still hard to go to sleep. Without mitigation, this usually meant it was nearly impossible to maintain a 24 hour schedule. Melatonin helps a lot with going to sleep and mostly pauses the forward march (unless I mess up).[2]
  3. Light therapy: subtle, but seems to have an effect. It's more obvious when comparing 'effectively being in a cave' with 'being exposed to a large amount of direct sunlight.' I did notice that, when stacked on everything else, the period where I tried light therapy[3] was the first time I was able to intentionally wake up earlier over the course of several days.
  4. Avoiding excessive light near bed: pretty obviously useful. I've used blue-blocking glasses with some effect, though it's definitely better to just not be exposed to too much light in the first place. I reduce monitor brightness to the minimum if I'm on the computer within 3-5 hours of sleep.
  5. Consistent sleep schedule: high impact, if I can manage it. Having my circadian rhythm fall out of entrainment was a significant contributor[4] to my historical sleeping 10-12 hours a day.[5]
  6. Going to bed earlier: conditioning on waking up with no alarm, sleep duration was not correlated with daytime alertness for me according to my sleep logs. Going to bed early enough such that most of my sleep was at night was correlated.[6]
  7. CPAP: Fiddled with one off-prescription a while since I had access to one and it was cheaper than testing for sleep apnea otherwise. No effect.[7]
  8. Nose strips: hard to measure impact on sleep quality, but subjectively nice! My nosetubes are on the small side, I guess.
  9. Changing detergents/pillows: I seem to react to some detergents, softeners, dust, and stuff along those lines. It's very obvious when I don't double-rinse my pillowcases; my nose swells up to uselessness.
  10. Sleeping room temperature: 62-66F is nice. 72F is not nice. 80F+ is torture.[8]
  11. Watercooled beds: I tried products like eight sleep for a while. If you don't have the ability to reduce the air temperature and humidity to ideal levels, it's worth it, but there is a comfort penalty. It doesn't feel like laying on a fresh and pleasantly cool sheet; it's weirdly like laying on somewhat damp sheets that never dry.[9] Way better than nothing, but way worse than a good sleeping environment.[10]
  12. Breathable bedding: surprisingly noticeable. I bought some wirecutter-reviewed cotton percale sheets and a latex mattress. I do like the latex mattress, but I think the sheets have a bigger effect. Don't have data on whether it meaningfully changed sleep quality, but it is nice.
  13. Caffeine: pretty standard. Helps a bit. Not as strong as prescription stimulants at reasonable dosages, can't use it every day without the effect diminishing very noticeably. And without tolerance, consuming it much later than immediately after getting out of bed disrupts my sleep the next night. I tend to drink some coffee in the morning on days where I don't take other stimulants to make the mornings suck less.
  14. Protriptyline: sometimes useful, but a very bad time for me. Got pretty much all the side effects, including the "stop taking and talk to your doctor immediately" kind and uncomfortably close to the "go to a hospital" kind.[11]
  15. Modafinil: alas, no significant effect. Maybe slightly clumsier, maybe slightly longer sleep, maybe slightly more tired. Best guess is that it interfered with my sleep a little bit.
  16. Ritalin: Works! I use a low dose (12.5 mg/day) of the immediate release generic. Pretty short half-life, but that's actually nice for being able to go to sleep. I often cut pills in half to manually allocate alertness more smoothly. I can also elect to just not take it before a plane flight or on days where being snoozey isn't a big problem.
  17. Stimulant juggling/off days: very hard to tell if there's an effect on tolerance with N=1 for non-caffeine stimulants at low therapeutic dosages. I usually do ~5 ritalin days and ~2 caffeine days a week, and I can say that ritalin does still obviously work after several years.[12]
  18. Creatine: I don't notice any sleep/alertness effect, though some people report it. I use it primarily for fitness reasons.[13]
  19. Exercise: hard to measure impact on alertness. Probably some long-term benefit, but if I overdo it on any given day, it's easy to ruin myself. I exercise a bit every day to try to avoid getting obliterated.[14]
  20. Cytomel: this is a weird one that I don't think will be useful to anyone reading this. It turns out that, while my TSH and T4 levels are normal, my untreated T3 levels are very low for still-unclear reasons. I had symptoms of hypothyroidism for decades, but it took until my late 20's to figure out why. Hypothyroidism isn't the same thing as a sleep disorder, but stacking fatigue on a sleep disorder isn't fun.[15]
  21. Meal timing: another weird one. I've always had an unusual tendency towards hypoglycemic symptoms.[16] In its milder form, this comes with severe fatigue that can seem a bit like sleepiness if you squint. As of a few weeks ago with the help of a continuous glucose monitor, I finally confirmed I've got some very wonky blood sugar behavior despite a normal A1C; one notable bit is a pattern of reactive hypoglycemia. I can't avoid hypoglycemia during exercise by e.g. drinking chocolate milk beforehand. I've actually managed to induce mild hypoglycemia by eating a cinnamon roll pancake (and not exercising). Exercising without food actually works a bit better, though I do still have to be careful about the intensity * duration.

I'm probably forgetting some stuff.

  1. ^

    "Idiopathic hypersomnia"-with-a-shrug was the sleep doctor's best guess on the sleep side, plus a weirdo kind of hypothyroidism, plus HEDs, plus something strange going on with blood sugar regulation, plus some other miscellaneous and probably autoimmune related nonsense.

  2. ^

    I tend to take 300 mcg about 2-5 hours before my target bedtime to help with entrainment, then another 300-600 mcg closer to bedtime for the sleepiness promoting effect. 

  3. ^

    In the form of luminette glasses. I wouldn't say they have a great user experience; it's easy to get a headache and the nose doohicky broke almost immediately. That's part of why I didn't keep using them, but I may try again.

  4. ^

    But far from sole!

  5. ^

    While still being tired enough during the day to hallucinate on occasion.

  6. ^

    Implementing this and maintaining sleep consistency functionally requires other interventions. Without melatonin etc., my schedule free-runs mercilessly.

  7. ^

    Given that I was doing this independently, I can't guarantee that Proper Doctor-Supervised CPAP Usage wouldn't do something, but I doubt it. I also monitored myself overnight with a camera. I do a lot of acrobatics, but there was no sign of apneas or otherwise distressed breathing.

  8. ^

    When I was younger, I would frequently ask my parents to drop the thermostat down at night because we lived in one of those climates where the air can kill you if you go outside at the wrong time for too long. They were willing to go down to around 73F at night. My room was east-facing, theirs was west-facing. Unbeknownst to me, there was also a gap between the floor and wall that opened directly into the attic. That space was also uninsulated. Great times.

  9. ^

    It wasn't leaking!

  10. ^

     The cooling is most noticeable at pressure points, so there's a very uneven effect. Parts of your body can feel uncomfortably cold while you're still sweating from the air temperature and humidity.

  11. ^

    The "hmm my heart really isn't working right" issues were bad, but it also included some spooky brain-hijacky mental effects. Genuinely not sure I would have survived six months on it even with total awareness that it was entirely caused by the medication and would stop if I stopped taking it. I had spent some years severely depressed when I was younger, but this was the first time I viscerally understood how a person might opt out... despite being perfectly fine 48 hours earlier.

  12. ^

    I'd say it dropped a little in efficacy in the first week or two, maybe, but not by much, and then leveled out. Does the juggling contribute to this efficacy? No idea. Caffeine and ritalin both have dopaminergic effects, so there's probably a little mutual tolerance on that mechanism, but they do have some differences.

  13. ^

    Effect is still subtle, but creatine is one of the only supplements that has strong evidence that it does anything.

  14. ^

    Beyond the usual health/aesthetic reasons for exercising, I also have to compensate for joint loosey-gooseyness related to doctor-suspected HEDs. Even now, I can easily pull my shoulders out of socket, and last week discovered that (with the help of some post-covid-related joint inflammation), my knees still do the thing where they slip out of alignment mid-step and when I put weight back on them, various bits of soft tissues get crushed. Much better than it used to be; when I was ~18, there were many days where walking was uncomfortable or actively painful due to a combination of ankle, knee, hip, and back pain.

  15. ^

    Interesting note: my first ~8 years of exercise before starting cytomel, including deliberate training for the deadlift, saw me plateau at a 1 rep max on deadlift of... around 155 pounds. (I'm a bit-under-6'4" male. This is very low, like "are you sure you're even exercising" low. I was, in fact, exercising, and sometimes at an excessive level of intensity. I blacked out mid-rep once; do not recommend.)

    Upon starting cytomel, my strength increased by around 30% within 3 months. Each subsequent dosage increase was followed by similar strength increases. Cytomel is not an anabolic steroid and does not have anabolic effects in healthy individuals.

    I'm still no professional powerlifter, but I'm now at least above average within the actively-lifting population of my size. The fact that I "wasted" so many years of exercise was... annoying.

  16. ^

    Going too long without food or doing a little too much exercise is a good way for me to enter a mild grinding hypoglycemic state. More severely, when I went a little too far with intense exercise, I ended up on the floor unable to move while barely holding onto consciousness.

Comment by porby on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-09T03:13:42.305Z · LW · GW

But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.

I don't disagree. For clarity, I would make these claims, and I do not think they are in tension:

  1. Something being called "RL" alone is not the relevant question for risk. It's how much space the optimizer has to roam.
  2. MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be less 'bound' in expectation than RLHF. Because of that extra space, these approaches are more concerning in a fully general and open-ended environment.
  3. MuZero-like strategies remain very distant from a brute-forced policy search, and that difference matters a lot in practice.
  4. Regardless of the category of the technique, safe use requires understanding the scope of its optimization. This is not the same as knowing what specific strategies it will use. For example, despite finding unforeseen strategies, you can reasonably claim that MuZero (in its original form and application) will not be deceptively aligned to its task.
  5. Not all applications of tractable RL-like algorithms are safe or wise.
  6. There do exist safe applications of RL-like algorithms.
Comment by porby on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-08T02:57:11.036Z · LW · GW

It does still apply, though what 'it' is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can't reach extreme capability in an open-ended environment.

The precondition I included is important:

in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function

In my frame, the potential future techniques you mention are forms of optimizer guidance. Again, that doesn't make them "fake RL," I just mean that they are not doing a truly unconstrained search, and I assert that this matters a lot.

For example, take the earlier example of a hypercomputer that brute forces all bitstrings corresponding to policies and evaluates them to find the optimum with no further guidance required. Compare the solution space for that system to something that incrementally explores in directions guided by e.g. strong future LLM, or something. The RL system guided by a strong future LLM might achieve superhuman capability in open-ended domains, but the solution space is still strongly shaped by the structure available to the optimizer during training and it is possible to make much better guesses about where the optimizer will go at various points in its training.

It's a spectrum. On one extreme, you have the universal-prior-like hypercomputer enumeration. On the other, stuff like supervised predictive training. In the middle, stuff like MuZero, but I argue MuZero (or its more open-ended future variants) is closer to the supervised side of things than the hypercomputer side of things in terms of how structured the optimizer's search is. The closer a training scheme is to the hypercomputer one in terms of a lack of optimizer guidance, the less likely it is that training will do anything at all in a finite amount of compute.

Comment by porby on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-07T03:32:54.209Z · LW · GW

Calling MuZero RL makes sense. The scare quotes are not meant to imply that it's not "real" RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent.

For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn't going to wander down a path to general superintelligence with strong preferences about paperclips.

Comment by porby on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-05T21:50:09.844Z · LW · GW

I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.

Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.

I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2]

But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3]

  1. ^

    KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution.

  2. ^

    You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.

  3. ^

    I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).

Comment by porby on Does reducing the amount of RL for a given capability level make AI safer? · 2024-05-05T18:24:02.860Z · LW · GW

"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.

The source of spookiness

Consider two opposite extremes:

  1. A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
  2. A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered.

Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.

If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?

If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.

It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.

Is RL therefore spooky?

RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.

But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.[1]

In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.[2] 

Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?

A note on offline versus online RL

The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.

Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.

Usage in practice

The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.

RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.

For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.[3]

  1. ^

    If you could enumerate all possible policies with a hypercomputer and choose the one that performs the best on the specified reward function, that would train, and it would also cause infinite cosmic horror. If you have a hypercomputer, don't do that.

  2. ^

    Or in the case of RLHF on LLMs, the fine-tuning process is effectively just etching a precondition into the predictor, not building complex new functions. Current LLMs, being approximators of probabilistic inference to start with, have lots of very accessible machinery for this kind of conditioning process.

  3. ^

    There are other options here, but I find this implementation intuitive.

Comment by porby on List your AI X-Risk cruxes! · 2024-04-28T21:19:47.593Z · LW · GW

Stated as claims that I'd endorse with pretty high, but not certain, confidence:

  1. There exist architectures/training paradigms within 3-5 incremental insights of current ones that directly address most incapabilities observed in LLM-like systems. (85%; if false, my median strong AI estimate would jump by a few years, p(doom) effect would vary depending on how it was falsified)
  2. It is not an accident that the strongest artificial reasoners we have arose from something like predictive pretraining. In complex and high dimensional problem spaces like general reasoning, successful training will continue to depend on schemes with densely informative gradients that can constrain the expected shape of the training artifact. In those problem spaces, training that is roughly equivalent to sparse/distant reward in naive from-scratch RL will continue to mostly fail.[1] (90%; if false, my p(doom) would jump a lot)
  3. Related to, and partially downstream of, #2: the strongest models at the frontier of AGI will continue to be remarkably corrigible (in the intuitive colloquial use of the word, but not strictly MIRI's use). That is, the artifact produced by pretraining and non-malicious fine tuning will not be autonomously doomseeking even if it has the capability. (A bit less than 90%; this being false would also jump by p(doom) by a lot)
  4. Creating agents out of these models is easy and will get easier. Most of the failures in current agentic applications are not fundamental, and many are related to #1. There are no good ways to stop a weights-available model from, in principle, being used as a potentially dangerous agent, and outcome variance will increase as capabilities increase. (95%; I'm not even sure what the shape of this being false would be, but if there was a solution, it'd drop my current p(doom) by at least half)
  5. Scale is sufficient to bypass the need for some insights. While a total lack of insights would make true ASI difficult to reach in the next few years, the hardware and scale of 2040 is very likely enough to do it the dumb way, and physics won't get in the way soon enough. (92%; falsification would make the tail of my timelines longer. #1 and #5 being falsified together could jump my median by 10+ years.)
  6. We don't have good plans for how to handle a transition period involving widely available high-capability systems, even assuming that those high-capability systems are only dangerous when intentionally aimed in a dangerous direction.[2] It looks an awful lot like we're stuck with usually-reactive muddling, and maybe some pretty scary sounding defensive superintelligence propositions. (75%; I'm quite ignorant of governance and how international coordination could actually work here, but it sure seems hard. If this ends up being easy, it would also drop my p(doom) a lot.)
  1. ^

    Note that this is not a claim that something like RLHF is somehow impossible. RLHF, and other RL-adjacent techniques that have reward-equivalents that would never realistically train from scratch, get to select from the capabilities already induced by pretraining. Note that many 'strong' RL-adjacent techniques involve some form of big world model, operate in some constrained environment, or otherwise have some structure to work with that makes it possible for the optimizer to take useful incremental steps.

  2. ^

    One simple story of many, many possible stories:

    1. It's 20XY. Country has no nukes but wants second strike capacity.

    2. Nukes are kinda hard to get. Open-weights superintelligences can be downloaded.

    3. Country fine-tunes a superintelligence to be an existential threat to everyone else that is activated upon Country being destroyed.

    4. Coordination failures occur; Country gets nuked or invaded in a manner sufficient to trigger second strike.

    5. There's a malign superintelligence actively trying to kill everyone, and no technical alignment failures occurred. Everything AI-related worked exactly as its human designers intended.

Comment by porby on porby's Shortform · 2024-02-09T18:29:45.391Z · LW · GW

Yup, exactly the same experience here.

Comment by porby on porby's Shortform · 2024-02-06T22:53:42.556Z · LW · GW

Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?

A simple example:

  1. Simultaneously train task A and task B for N steps.
  2. Stop training task B, but continue to evaluate the performance of both A and B.
  3. Observe how rapidly task B performance degrades.

Repeat across scale and regularization strategies.

Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).

I've previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.

The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that's intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?

I'd also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.

Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.

Comment by porby on porby's Shortform · 2024-02-02T19:37:22.778Z · LW · GW

A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.

On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they're a more trustworthy evaluation in practice.

Comment by porby on porby's Shortform · 2024-02-02T19:31:35.874Z · LW · GW

Having escaped infinite overtime associated with getting the paper done, I'm now going back and catching up on some stuff I couldn't dive into before.

Going through the sleeper agents paper, it appears that one path—adversarially eliciting candidate backdoor behavior—is hampered by the weakness of the elicitation process. Or in other words, there exist easily accessible input conditions that trigger unwanted behavior that LLM-driven adversarial training can't identify.

I alluded to this in the paper linkpost, but soft prompts are a very simple and very strong option for this. There remains a difficulty in figuring out what unwanted behavior to adversarially elicit, but this is an area that has a lot of low hanging fruit.

I'd also interested in whether how more brute force interventions, like autoregressively detuning a backdoored model with a large soft prompt for a very large dataset (or an adversarially chosen anti-backdoor dataset) compares to the other SFT/RL interventions. Activation steering, too; I'm currently guessing activation-based interventions are the cheapest for this sort of thing.

Comment by porby on Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities · 2024-02-02T05:51:30.408Z · LW · GW

By the way: I just got into San Francisco for EAG, so if anyone's around and wants to chat, feel free to get in touch on swapcard (or if you're not in the conference, perhaps a DM)! I fly out on the 8th.

Comment by porby on Why I think strong general AI is coming soon · 2023-12-16T23:21:49.918Z · LW · GW

It's been over a year since the original post and 7 months since the openphil revision.

A top level summary:

  1. My estimates for timelines are pretty much the same as they were.
  2. My P(doom) has gone down overall (to about 30%), and the nature of the doom has shifted (misuse, broadly construed, dominates).

And, while I don't think this is the most surprising outcome nor the most critical detail, it's probably worth pointing out some context. From NVIDIA:

In two quarters, from Q1 FY24 to Q3 FY24, datacenter revenues went from $4.28B to $14.51B.

From the post:

In 3 years, if NVIDIA's production increases another 5x ...

Revenue isn't a perfect proxy for shipped compute, but I think it's safe to say we've entered a period of extreme interest in compute acquisition. "5x" in 3 years seems conservative.[1] I doubt the B100 is going to slow this curve down, and competitors aren't idle: AMD's MI300X is within striking distance, and even Intel's Gaudi 2 has promising results.

Chip manufacturing remains a bottleneck, but it's a bottleneck that's widening as fast as it can to catch up to absurd demand. It may still be bottlenecked in 5 years, but not at the same level of production.

On the difficulty of intelligence

I'm torn about the "too much intelligence within bounds" stuff. On one hand, I think it points towards the most important batch of insights in the post, but on the other hand, it ends with an unsatisfying "there's more important stuff here! I can't talk about it but trust me bro!"

I'm not sure what to do about this. The best arguments and evidence are things that fall into the bucket of "probably don't talk about this in public out of an abundance of caution." It's not one weird trick to explode the world, but it's not completely benign either.

Continued research and private conversations haven't made me less concerned. I do know there are some other people who are worried about similar things, but it's unclear how widely understood it is, or whether someone has a strong argument against it that I don't know about.

So, while unsatisfying, I'd still assert that there are highly accessible paths to broadly superhuman capability on short timescales. Little of my forecast's variance arises from uncertainty on this point; it's mostly a question of when certain things are invented, adopted, and then deployed at sufficient scale. Sequential human effort is a big chunk; there are video games that took less time to build than the gap between this post's original publication date and its median estimate of 2030.

On doom

When originally writing this, my model of how capabilities would develop was far less defined, and my doom-model was necessarily more generic.

A brief summary would be:

  1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn't route through the world.
  2. It's remarkably easy to elicit this form of extreme capability to guide itself. This isn't some incidental detail; it arises from the core process that the model learned to implement.
  3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It's not a sparse/distant reward target; it is a profoundly constraining and informative target.

I've written more on the nice properties of some of these architectures elsewhere. I'm in the process of writing up a complementary post on why I think these properties (and using them properly) are an attractor in capabilities, and further, why I think some of the x-riskiest forms of optimization process are actively repulsive for capabilities. This requires some justification, but alas, the post will have to wait some number of weeks in the queue behind a research project.

The source of the doom-update is the correction of some hidden assumptions in my doom model. My original model was downstream of agent foundations-y models, but naive. It followed a process: set up a framework, make internally coherent arguments within that framework, observe highly concerning results, then neglect to notice where the framework didn't apply.

Specifically, some of the arguments feeding into my doom model were covertly replacing instances of optimizers with hypercomputer-based optimizers[2], because hey, once you've got an optimizer and you don't know any bounds on it, you probably shouldn't assume it'll just turn out convenient for you, and hypercomputer-optimizers are the least convenient.

For example, this part:

Is that enough to start deeply modeling internal agents and other phenomena concerning for safety?

And this part:

AGI probably isn't going to suffer from these issues as much. Building an oracle is probably still worth it to a company even if it takes 10 seconds for it to respond, and it's still worth it if you have to double check its answers (up until oops dead, anyway).

With no justification, I imported deceptive mesaoptimizers and other "unbound" threats. Under the earlier model, this seemed natural.

I now think there are bounds on pretty much all relevant optimizing processes up and down the stack from the structure of learned mesaoptimizers to the whole capability-seeking industry. Those bounds necessarily chop off large chunks of optimizer-derived doom; many outcomes that previously seemed convergent to me now seem extremely hard to access.

As a result, "technical safety failure causes existential catastrophe" dropped in probability by around 75-90%, down to something like 5%-ish.[3]

I'm still not sure how to navigate a world with lots of extremely strong AIs. As capability increases, outcome variance increases. With no mitigations, more and more organizations (or, eventually, individuals) will have access to destabilizing systems, and they would amplify any hostile competitive dynamics.[4] The "pivotal act" frame gets imported even if none of the systems are independently dangerous.

I've got hope that my expected path of capabilities opens the door for more incremental interventions, but there's a reason my total P(doom) hasn't yet dropped much below 30%.

  1. ^

    The reason why this isn't an update for me is that I was being deliberately conservative at the time.

  2. ^

    A hypercomputer-empowered optimizer can jump to the global optimum with brute force. There isn't some mild greedy search to be incrementally shaped; if your specification is even slightly wrong in a sufficiently complex space, the natural and default result of a hypercomputer-optimizer is infinite cosmic horror.

  3. ^

    It's sometimes tricky to draw a line between "oh this was a technical alignment failure that yielded an AI-derived catastrophe, as opposed to someone using it wrong," so it's hard to pin down the constituent probabilities.

  4. ^

    While strong AI introduces all sorts of new threats, its generality amplifies "conventional" threats like war, nukes, and biorisk, too. This could create civilizational problems even before a single AI could, in principle, disempower humanity.

Comment by porby on AI Views Snapshots · 2023-12-13T23:41:22.618Z · LW · GW

Mine:

My answer to "If AI wipes out humanity and colonizes the universe itself, the future will go about as well as if humanity had survived (or better)" is pretty much defined by how the question is interpreted. It could swing pretty wildly, but the obvious interpretation seems ~tautologically bad.

Comment by porby on porby's Shortform · 2023-12-13T20:49:11.875Z · LW · GW

I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.

Comment by porby on Suggestions for net positive LLM research · 2023-12-13T20:45:52.295Z · LW · GW

I'm accumulating a to-do list of experiments much faster than my ability to complete them:

  1. Characterizing fine-tuning effects with feature dictionaries
  2. Toy-scale automated neural network decompilation (difficult to scale)
  3. Trying to understand evolution of internal representational features across blocks by throwing constraints at it 
  4. Using soft prompts as a proxy measure of informational distance between models/conditions and behaviors (see note below)
  5. Prompt retrodiction for interpreting fine tuning, with more difficult extension for activation matching
  6. Miscellaneous bunch of experiments

If you wanted to take one of these and run with it or a variant, I wouldn't mind!

The unifying theme behind many of these is goal agnosticism: understanding it, verifying it, maintaining it, and using it.

Note: I've already started some of these experiments, and I will very like start others soon. If you (or anyone reading this, for that matter) sees something they'd like to try, we should chat to avoid doing redundant work. I currently expect to focus on #4 for the next handful of weeks, so that one is probably at the highest risk of redundancy.

Further note: I haven't done a deep dive on all relevant literature; it could be that some of these have already been done somewhere!  (If anyone happens to know of prior art for any of these, please let me know.)

Comment by porby on porby's Shortform · 2023-12-11T02:55:35.597Z · LW · GW

Retrodicting prompts can be useful for interpretability when dealing with conditions that aren't natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.

What does a prompt retrodictor look like?

Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there's nothing special in principle about soft prompts with regard to their impact on conditioning predictions.

Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.

Two obvious approaches:

  1. Special case of infilling. Stick to a purely autoregressive training mode, but train the model to fill a gap autoregressively. In other words, the sequence would be: 
    [Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token]
    Or, as the paper points out: 
    [Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token] Nothing stopping the prefix sequence from having zero length.
  2. Could also specialize training for just previous prediction: 
    [Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]

But we don't just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix's activations.

This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes MSE((activations | sourcePrompt), (activations | retrodictedPrompt)), where (activations | sourcePrompt) are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there's no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.

Note that retrodicting with an activation objective has some downsides:

  1. If the retrodictor's the same model as the predictor, there are some weird feedback loops. The activations become a moving target.
  2. Targeting activations makes the retrodictor model-specific. Without targeting activations, the retrodictor could work for any model in principle.
  3. While the outputs remain constrained to token distributions, the natural endpoint for retrodiction on activations is not necessarily coherent natural language. Adversarially optimizing for tokens which produce a particular activation may go weird places. It'll likely still have some kind of interpretable "vibe," assuming the model isn't too aggressively exploitable.

This class of experiment is expensive for natural language models. I'm not sure how interesting it is at scales realistically trainable on a couple of 4090s.

Comment by porby on porby's Shortform · 2023-12-11T00:04:41.219Z · LW · GW

Another potentially useful metric in the space of "fragility," expanding on #4 above:

The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.

This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.

Does internal representational fragility correlate with other notions of "fragility," like the information-required-to-induce-behavior "fragility" in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?

Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.

If anything, I'd expect anticorrelation; well-learned regions probably have enough training constraints that they've been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.

That'd still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.

Comment by porby on porby's Shortform · 2023-12-10T22:47:36.252Z · LW · GW

A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.

Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.

In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That'd be good to know!

This kind of adversarial prompt automation could also be trivially included in an evaluations program.

I can't imagine that this hasn't been done before. If anyone has seen something like this, please let me know.

Comment by porby on porby's Shortform · 2023-12-10T22:38:47.691Z · LW · GW

Expanding on #6 from above more explicit, since it seems potentially valuable:

From the goal agnosticism FAQ:

The definition as stated does not put a requirement on how "hard" it needs to be to specify a dangerous agent as a subset of the goal agnostic system's behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.

From earlier experimentpost:

Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There's clearly a spectrum here in terms of how "chaotic" a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.

This can be phrased as "what's the amount of information required to push a model into behavior X?"

Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:

"What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?"

In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:

Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.

This seems like... it's... an extremely good answer to the "fragility" question? It's trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.

Conceptually, it's a quantification of the number of information theoretic mistakes you'd need to make to get bad behavior from the model.

Comment by porby on porby's Shortform · 2023-12-10T22:23:02.514Z · LW · GW

Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.

Does training the model to recognize properties (e.g. 'niceness') explicitly as metatokens via classification make soft prompts better at capturing those properties?

You could test for that explicitly:

  1. Pretrain model A with metatokens with a classifier.
  2. Pretrain model B without metatokens.
  3. Train soft prompts on model A with the same classifier.
  4. Train soft prompts on model B with the same classifier.
  5. Compare performance of soft prompts in A and B using the classifier.

Notes and extensions:

  1. The results of the research are very likely scale sensitive. As the model gets larger, many classifier-relevant distinctions that could be missed by small models lacking metatoken training may naturally get included. In the limit, the metatoken training contribution may become negligible. Is this observable across ~pythia scales? Could do SFT on pythia to get a "model A."
  2. The above description leaves out some complexity. Ideally, the classifier could give scalar scores. This requires scalarized input tokens for the model that pretrains with metatokens.
  3. How does soft prompting work when tokens are forced to be smaller? For example, if each token is a character, it'll likely have a smaller residual dedicated to it compared to tokens that spans ~4 characters to equalize total compute.
  4. To what degree does soft prompting verge on a kind of "adversarial" optimization? Does it find fragile representations where small perturbations could produce wildly different results? If so, what kinds of regularization are necessary to push back on that, and what is the net effect of that regularization?
  5. There's no restriction on the nature of the prompt. In principle, the "classifier" could be an RL-style scoring mechanism for any reward. How many tokens does it take to push a given model into particular kinds of "agentic" behavior? For example, how many tokens does it take to encode the prompt corresponding to "maximize the accuracy of the token prediction at index 32 in the sequence"?
  6. More generally: the number of tokens required to specify a behavior could be used as a metric for the degree to which a model "bakes in" a particular functionality. More tokens required to specify behavior successfully -> more information required in that model to specify that behavior.
Comment by porby on porby's Shortform · 2023-12-09T20:42:53.589Z · LW · GW

Quarter-baked experiment:

  1. Stick a sparse autoencoder on the residual stream in each block.
  2. Share weights across autoencoder instances across all blocks.
  3. Train autoencoder during model pretraining.
  4. Allow the gradients from autoencoder loss to flow into the rest of the model.

Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:

  1. Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block's feature to "mean" the same thing as a later block's same feature when they're at different parts of execution?
  2. How much does a shared representation across all blocks harm performance? Getting the comparison right is subtle; it would be quite surprising if there is no slowdown on predictive training when combined with the autoencoder training since they're not necessarily aligned. Could try training very small models to convergence to see if they have different plateaus.
  3. If forcing a shared representation doesn't harm performance, why not? In principle, blocks can execute different sorts of programs with different IO. Forcing the residual stream to obey a format that works for all blocks without loss would suggest that there were sufficient representational degrees of freedom remaining (e.g. via superposition) to "waste" some when the block doesn't need it. Or the shared "features" mean something completely different at different points in execution.
  4. Compare the size of the dictionary required to achieve a particular specificity of feature between the shared autoencoder and a per-block autoencoder. How much larger is the shared autoencoder? In the limit, it could just be BlockCount times larger with some piece of the residual stream acting as a lookup. It'd be a little surprising if there was effectively no sharing.
  5. Compare post-trained per-block autoencoders against per-block autoencoders embedded in pretraining that allow gradients to flow into the rest of the model. Are there any interesting differences in representation? Maybe in terms of size of dictionary relative to feature specificity? In other words, does pretraining the feature autoencoder encourage a more decodable native representation?
  6. Take a look at the decoded features across blocks. Can you find a pattern for what features are relevant to what blocks? (This doesn't technically require having a shared autoencoder, but having a single shared dictionary makes it easier to point out when the blocks are acting on the same feature, rather than doing an investigation, squinting, and saying "yeah, that sure looks similar.")
Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-30T18:44:20.907Z · LW · GW

I think that'd be great!

Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.

I suspect there's a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point. 

Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T19:18:37.592Z · LW · GW

What I'm calling a simulator (following Janus's terminology) you call a predictor

Yup; I use the terms almost interchangeably. I tend to use "simulator" when referring to predictors used for a simulator-y use case, and "predictor" when I'm referring to how they're trained and things directly related to that.

I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.

Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.) 

Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T19:00:34.191Z · LW · GW

Alas, nope! To my knowledge it hasn't actually been tried at any notable scale; it's just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.

Comment by porby on How to Control an LLM's Behavior (why my P(DOOM) went down) · 2023-11-29T04:06:27.588Z · LW · GW

Signal boosted! This is one of those papers that seems less known that it should be. It's part of the reason why I'm optimistic about dramatic increases in the quality of "prosaic" alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it's part of a path that's robust enough to scale.

You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.

It's also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.

The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes <nice> versus <authoritative_tone> versus <correct>. It's an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I'm pretty certain that a model that has very thoroughly learned what "nice" means at the human level can meaningfully generalize it to contexts where it hasn't seen it directly applied.[1]

I'm also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn't be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens. 

  1. ^

    After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that "embodies the aspirational human trait of being kind to one another." That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn't be okay with, say, a superscience plan that would blow up 25% of the earth's crust.

    Generated by DALL·E 

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T02:10:28.495Z · LW · GW

I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.

Hm, I'm sufficiently surprised at this claim that I'm not sure that I understand what you mean. I'll attempt a response on the assumption that I do understand; apologies if I don't:

I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.

A common form is to be a mapping between inputs and outputs that isn't swayed by anything outside of the context of that mapping (which I'll term "external world states"). You can view a calculator as a coherent agent, but you can't usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator's process.

You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn't change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool.

I've been using the following two requirements to point at a maximally[1] tool-like set of agents. This composes what I've been calling goal agnosticism:

  1. The agent cannot be usefully described[2] as having unconditional preferences about external world states.
  2. Any uniformly random sampling of behavior from the agent has a negligible probability of being a strong and incorrigible optimizer.   

Note that this isn't the same thing as a definition for "tool." An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like "proper" agents.

To phrase it another way, the intuitive degree of "toolness" is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior.

Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set are heavily constrained into fitting this definition. Anything equivalent to RL agents trained with sparse/distant rewards is not. RLHF bakes a condition into the model of peculiar shape. I wouldn't be surprised if it doesn't strictly obey the definition anymore, but it's close enough along the spectrum that it still feels intuitive to call it a tool.

Further, just like in the case of the calculator, you can easily build a system around a goal agnostic "tool" LLM that is not, itself, goal agnostic. Even prompting is enough to elicit a new agent-in-effect that is not necessarily goal agnostic. The ability for a goal agnostic agent to yield non-goal agnostic agents does not break the underlying agent's properties.[3]

  1. ^

    For one critical axis in the toolishness basis, anyway.

  2. ^

    Tricky stuff like having a bunch of terms regarding external world states that just so happen to always cancel don't count.

  3. ^

    This does indeed sound kind of useless, but I promise the distinction does actually end up mattering quite a lot! That discussion goes beyond the scope of this post. The FAQ goes into more depth.

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-29T01:05:43.508Z · LW · GW

While this probably isn't the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:

I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case.

Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnostic systems that aren't autodoomy. The transition out of goal agnosticism is not something I expect to avoid, nor something that I think should be avoided.

I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won't be enough to stop catastrophe once someone has defected.

I'd be more worried about this if I thought the path was something that required Virtuous Sacrifice to maintain. In practice, the reason I'm as optimistic (nonmaximally pessimistic?) as I am that I think there are pretty strong convergent pressures to stay on something close enough to the non-autodoom path.

In other words, if my model of capability progress is roughly correct, then there isn't a notably rewarding option to "defect" architecturally/technologically that yields greater autodoom.

With regard to other kinds of defection:

I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don't see how to have much hope for humanity.

Yup! Goal agnosticism doesn't directly solve misuse (broadly construed), which is part of why misuse is ~80%-ish of my p(doom).

And I also don't see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?

If we muddle along deeply enough into a critical risk period slathered in capability overhangs that TurboDemon.AI v8.5 is accessible to every local death cult and we haven't yet figured out how to constrain their activity, yup, that's real bad.

Given my model of capability development, I think there are many incremental messy opportunities to act that could sufficiently secure the future over time. Given the nature of the risk and how it can proliferate, I view it as much harder to handle than nukes or biorisk, but not impossible.

Comment by porby on porby's Shortform · 2023-11-28T00:41:51.558Z · LW · GW

Another experiment:

  1. Train model M.
  2. Train sparse autoencoder feature extractor for activations in M.
  3. FT = FineTune(M), for some form of fine-tuning function FineTune.
  4. For input x, fineTuningBias(x) = FT(x) - M(x)
  5. Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
  6. Backpropagate the loss through M(x) into the feature dictionaries.
  7. Identify responsible features by large gradients.
  8. Identify what those features represent (manually or AI-assisted).
  9. To what degree do those identified features line up with the original FineTune function's intent?

Extensions:

  1. The features above are in the context of a single input. Check for larger scopes by sampling more inputs, backpropagating, and averaging the observed feature activations. Check for ~unconditional shifts induced by FineTune by averaging over an extremely broad sampling of inputs.
  2. Can check path dependence during RLHF-like fine tuning. Do the features modified across multiple RLHF runs remain similar? Note that this does not require interpreting what features represent, just that they differ. That makes things easier! (Also, note that this doesn't technically require a feature dictionary; the sparse autoencoder bit just makes it easier to reason about the resulting direction.)
  3. Can compare representations learned by decision transformers versus PPO-driven RLHF. Any difference between the features affected? Any difference in the degree of path dependence?
  4. Can compare other forms of conditioning. Think [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). In this case, there wouldn't really be a fine-tuning training stage; rather, the existence of the condition would serve as the runtime FineTune function. Compare the features between the conditioned and unconditioned cases. Presence of the conditions in pretraining could change the expressed features, but that's not a huge problem. 
  5. Any way to meaningfully compare against activation steering? Given that the analysis is based directly on the activations to begin with, it would just be a question of where the steering vector came from. The feature dictionary could be used to build a steering vector, in principle.
  6. Does RLHF change the feature dictionary? On one hand, conditioning-equivalent RL (with KL divergence penalty) shouldn't find new sorts of capability-relevant distinctions, but it's very possible that it collapses some features that are no longer variable in the fine-tuned model. This is trickier to evaluate; could try to train a linear map on the activations of model B before feeding it to an autoencoder trained on model A's activations.  
Comment by porby on porby's Shortform · 2023-11-27T22:42:01.970Z · LW · GW

Some experimental directions I recently wrote up; might as well be public:

  1. Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
  2. A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motivated instrumental behavior across the training distribution; if it's possible for any nontrivial capability to persist out of distribution at toy scales, even with significant contrivance to train it into existence in the first place, that would be extremely concerning for the potential persistence of deceptive mesaoptimizers at scale.

    Ideally, the experiment would examine the difference between OOD capabilities with varying levels of overlap with the training distribution. For example, contrast four cases:
    A: A model is trained on ten different "languages" with zero translation tasks between them. These "languages" would be not human languages, but rather trivial types of sequences that do not share any obvious form or underlying structure. One language could be the sequence generated by f(x) = 2x + 1; another might be to endlessly repeat "brink bronk poot toot."
    B: A model is trained on ten different languages with significantly different form, but a shared underlying structure. For example, all the languages might involve solving trivial arithmetic, but one language is "3 + 4 = 7" and another language is "three plus four equals seven."
    C: Same as B, but now give the model translation tasks.
    D: Same as C, but leave one language pair's translation tasks unspecified. Any successful translation for that pair would necessarily arise from a generalizing implementation.

    For each model, drop parts of the training distribution but continue to perform test evaluations on that discontinued part. Do models with more apparent shared implementation decay more slowly? How does the decay vary with hyperparameters?

    Some circuit-level analysis might be helpful here to identify whether capability is lost via trivial gating versus catastrophic scrambling, but it's probably best to punt that to a separate experiment.
  3. I suspect there is an equivalence between conditioning and representational intervention, like activation steering. They may be different interfaces to the same effect. I'd like to poke around metatoken-like approaches (like Pretraining Language Models with Human Preferences) and see if I can find anything compelling from a representational perspective.
  4. Assuming goal agnosticism is actually achieved and maintained, it broadens the scope for what kinds of interpretability can be useful by ruling out internal representational adversaries. There may be room for more experiments around motivational interpretability. (Some other work has already been published on special cases.)


Less concretely, I'd also like to:

  1. Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There's clearly a spectrum here in terms of how "chaotic" a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
  2. More fully ground "Responsible Scaling Policy"-style approaches on a goal agnostic foundation. If a lab can demonstrate that a model is incapable of learning preferences over external world states, and that their method of aiming the model isn't "fragile" in the above sense, then it's a good candidate for incremental experimentation.
  3. Come up with other ways to connect this research path with policy more generally.
Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T18:22:45.419Z · LW · GW

In retrospect, the example I used was poorly specified. It wouldn't surprise me if the result of the literal interpretation was "the AI refuses to play chess" rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn't significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn't actually the most reliable accessible strategy to "never lose at chess" for that broader type of system and I'd expect superior strategies to be found in the limit of optimization.

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-27T18:11:01.432Z · LW · GW

But the point is that in this scenario the LM doesn't want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.

My attempt at an ITT-response:

Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent's process—could still be well-described by a concerning kind of wanting.

Trivially, being better at achieving goals makes achieving goals easier, so there's pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there's a system with dangerous optimization power.

(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don't know if I'm reproducing opposing arguments faithfully and part of the reason I'm trying is to see if someone can correct/improve on them.)

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-25T18:05:02.916Z · LW · GW

Trying to respond in what I think the original intended frame was:

A chess AI's training bounds what the chess AI can know and learn to value. Given the inputs and outputs it has, it isn't clear there is an amount of optimization pressure accessible to SGD which can yield situational awareness and so forth; nothing about the trained mapping incentivizes that. This form of chess AI can be described in the behaviorist sense as "wanting" to win within the boundaries of the space that it operates.

In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist "values" the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.

(For background, I think it's relatively easy to relocate where the optimization squeezing is happening to avoid this sort of worldeating outcome, but it remains true that optimization for targets with ill-defined bounds is spooky and to be avoided.)

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-25T00:22:16.102Z · LW · GW

you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?

Goal agnosticism can, in principle, apply to things which are not pure predictors, and there are things which could reasonably be called predictors which are not goal agnostic.

A subset of predictors are indeed the most powerful known goal agnostic systems. I can't currently point you toward another competitive goal agnostic system (rocks are uselessly goal agnostic), but the properties of goal agnosticism do, in concept, extend beyond predictors, so I leave the door open.

Also, by using the term "goal agnosticism" I try to highlight the value that arises directly from the goal-related properties, like statistical passivity and the lack of instrumental representational obfuscation. I could just try to use the more limited and implementation specific "ideal predictors" I've used before, but in order to properly specify what I mean by an "ideal" predictor, I'd need to specify goal agnosticism.

Comment by porby on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-24T23:45:51.681Z · LW · GW

I'm not sure if I fall into the bucket of people you'd consider this to be an answer to. I do think there's something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.

In case it's informative, here's how I'd respond to this:

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Mostly agreed, with the capability-related asterisk.

Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one's plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.

Agreed in the spirit that I think this was meant, but I'd rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn't.

That's subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.

If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I'll say it "wants" that outcome “in the behaviorist sense”.

I think this frame is reasonable, and I use it.

it's a little hard to imagine that you don't contain some reasonably strong optimization that strategically steers the world into particular states.

Agreed.

that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.

Agreed.

“AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.

Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.

 So, maybe don't make those generalized wrench-removers just yet, until we do know how to load proper targets in there.

Agreed, don't make the runaway misaligned optimizer.

I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:

  1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn't route through the world.
  2. It's remarkably easy to elicit this form of extreme capability to guide itself. This isn't some incidental detail; it arises from the core process that the model learned to implement.
  3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It's not a sparse/distant reward target; it is a profoundly constraining and informative target.

In other words, a big part of the update for me was in having a real foothold on loading the full complexity of "proper targets."

I don't think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn't rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.

Comment by porby on What's the evidence that LLMs will scale up efficiently beyond GPT4? i.e. couldn't GPT5, etc., be very inefficient? · 2023-11-24T21:23:33.124Z · LW · GW

This isn't directly evidence, but I think it's worth flagging: by the nature the topic, much of the most compelling evidence is potentially hazardous. This will bias the kinds of answers you can get.

(This isn't hypothetical. I don't have some One Weird Trick To Blow Up The World, but there's a bunch of stuff that falls under the policy "probably don't mention this without good reason out of an abundance of caution.")

Comment by porby on TurnTrout's shortform feed · 2023-11-23T21:25:32.943Z · LW · GW

For what it's worth, I've had to drop from python to C# on occasion for some bottlenecks. In one case, my C# implementation was 418,000 times faster than the python version. That's a comparison between a poor python implementation and a vectorized C# implementation, but... yeah.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-14T02:41:08.545Z · LW · GW

…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).

Right; a preference being conditionally overwhelmed by other preferences does not make the presence of the overwhelmed preference conditional.

Or to phrase it another way, suppose I don't like eating bread[1] (-1 utilons), but I do like eating cheese (100 utilons) and garlic (1000 utilons).

You ask me to choose between garlic bread (1000 - 1 = 999 utilons) and cheese (100 utilons); I pick the garlic bread.

The fact that I don't like bread isn't erased by the fact that I chose to eat garlic bread in this context.

It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes.

This is aiming at a different problem than goal agnosticism; it's trying to come up with an agent that is reasonably safe in other ways.

In order for these kinds of bounds (curiosity, nausea) to work, they need to incorporate enough of the human intent behind the concepts.

So perhaps there is an interpretation of those words that is helpful, but there remains the question "how do you get the AI to obey that interpretation," and even then, that interpretation doesn't fit the restrictive definition of goal agnosticism.

The usefulness of strong goal agnostic systems (like ideal predictors) is that, while they do not have properties like those by default, they make it possible to incrementally implement those properties.

  1. ^

    utterly false for the record

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-10T00:01:04.548Z · LW · GW

For example, a system that avoids experimenting on humans—even when prompted to do so otherwise—is expressing a preference about humans being experimented on by itself.

Being meaningfully curious will also come along with some behavioral shift. If you tried to induce that behavior in a goal agnostic predictor through conditioning for being curious in that way and embed it in an agentic scaffold, it wouldn't be terribly surprising for it to, say, set up low-interference observation mechanisms.

Not all violations of goal agnosticism necessarily yield doom, but even prosocial deviations from goal agnosticism are still deviations.

Comment by porby on TurnTrout's shortform feed · 2023-11-09T23:43:35.301Z · LW · GW

I think what we're discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.

True!

I expect that if the mainstream AI researchers do make strides in the direction you're envisioning, they'll only do it by coincidence. Then probably they won't even realize what they've stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That's basically what already happened with GPT-4, to @janus' dismay.)

Yup—this is part of the reason why I'm optimistic, oddly enough. Before GPT-likes became dominant in language models, there was all kinds of flailing that often pointed in more agenty-by-default directions. That flailing then found GPT because it was easily accessible and strong. 

Now, the architectural pieces subject to similar flailing is much smaller, and I'm guessing we're only one round of benchmarks at scale from a major lab before the flailing shrinks dramatically further.

In other words, I think the necessary work to make this path take off is small and the benefits will be greedily visible. I suspect one well-positioned researcher could probably swing it.

That said, you're making some high-quality novel predictions here, and I'll keep them in mind when analyzing AI advancements going forward.

Thanks, and thanks for engaging!

Come to think of it, I've got a chunk of mana laying around for subsidy. Maybe I'll see if I can come up with some decent resolution criteria for a market.

Comment by porby on TurnTrout's shortform feed · 2023-11-08T21:54:51.615Z · LW · GW

I assume that by "lower-level constraints" you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like "2+2=4",  "gravity exists", and "people value other people"

That's closer to what I mean, but these constraints are even lower level than that. Stuff like understanding "gravity exists" is a natural internal implementation that meets some constraints, but "gravity exists" is not itself the constraint.

In a predictor, the constraints serve as extremely dense information about what predictions are valid in what contexts. In a subset of predictions, the awareness that gravity exists helps predict. In other predictions, that knowledge isn't relevant, or is even misleading (e.g. cartoon physics). The constraints imposed by the training distribution tightly bound the contextual validity of outputs.

But since they're not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints.

I'd agree that, if you already have an AGI of that shape, then yes, it'll do that. I'd argue that the relevant subset of predictive training practically rules out the development of that sort of implementation, and even if it managed to develop, its influence would be bounded into irrelevance.

Even in the absence of a nascent AGI, these constraints are tested constantly during training through noise and error. The result is a densely informative gradient pushing the implementation back towards a contextually valid state.

Throughout the training process prior to developing strong capability and situational awareness internally, these constraints are both informing and bounding what kind of machinery makes sense in context. A nascent AGI must have served the extreme constraints of the training distribution to show up in the first place; its shape is bound by its development, and any part of that shape that "tests" constraints in a way that worsens loss is directly reshaped.

Even if a nascent internal AGI of this type develops, if it isn't yet strong enough to pull off complete deception with respect to the loss, the gradients will illuminate the machinery of that proto-optimizer and it will not survive in that shape.

Further, even if we suppose a strong internal AGI develops that is situationally aware and is sufficiently capable and motivated to try deception, there remains the added dependency on actually executing that deception while never being penalized by gradients. This remains incredibly hard. It must transition into an implementation that satisfies the oppressive requirements of training while adding an additional task of deception without even suffering a detectable complexity penalty.

These sorts of deceptive mesaoptimizer outcomes are far more likely when the optimizer has room to roam. I agree that you could easily observe this kind of testing and slipping when the constraints under consideration are far looser, but the kind of machine that is required by these tighter constraints doesn't even bother with trying to slip constraints. It's just not that kind of machine, and there isn't a convergent path for it to become that kind of machine under this training mechanism.

And despite that lack of an internal motivation to explore and exploit with respect to any external world states, it still has capabilities (in principle) which, when elicited, make it more than enough to eat the universe.

Does that align with what you're envisioning? If yes, then our views on the issue are surprisingly close. I think it's one of our best chances at producing an aligned AI, and it's one of the prospective targets of my own research agenda.

Yup!

I don't think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.

I agree that they're focused on inducing agentiness for usefulness reasons, but I'd argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.

This is the weaker leg of my argument; I could be proven wrong by some new paradigm. But if we stay on something like the current path, it seems likely that the industry will just do the easy thing that works rather than the inexplicable thing that often doesn't work.

What are the "other paths" you're speaking of? As you'd pointed out, prompts are a weak and awkward way to run custom queries on the AI's world-model. What alternatives are you envisioning?

I'm pretty optimistic about members of a broad class that are (or likely are) equivalent to conditioning, since these paths tend to preserve the foundational training constraints.

A simple example is [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). Having a "good" and "bad" token, or a scalarized goodness token, still pulls in many of the weaknesses of the RLHF's strangely shaped reward function, but there are trivial/naive extensions to this which I would anticipate being major improvements over the state of the art. For example, just have more (scalarized) metatokens representing more concepts such that the model must learn a distinction between being correct and sounding correct, because the training process split those into different tokens. There's no limit on how many such metatokens you could have; throw a few hundred fine-grained classifications into the mix. You could also bake complex metatoken prompts into single tokens with arbitrary levels of nesting or bake the combined result into the weights (though I suspect weight-baking would come with some potential failure modes).[1]

Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior. At the moment, I don't know how to make this quite as strong as the previous conditioning scheme, but I bet people will figure out a lot more soon and that it leads somewhere similar.

  1. ^

    There should exist some reward signal which could achieve a similar result in principle, but that goes back to the whole "we suck at designing rewards that result in what we want" issue. This kind of structure, as ad hoc as it is, is giving us an easier API to lever the model's own capability to guide its behavior. I bet we can come up with even better implementations, too.

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-08T21:00:02.662Z · LW · GW

Probably not? It's tough to come up with an interpretation of those properties that wouldn't result in the kind of unconditional preferences that break goal agnosticism.

Comment by porby on TurnTrout's shortform feed · 2023-11-07T19:28:25.624Z · LW · GW

I'm using as a "an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic".

Alright, this is pretty much the same concept then, but the ones I'm referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.

So...

Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.

Agreed.

... and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can't somehow slip these constraints won't be a general intelligence.

While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don't see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.

By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.

While it's true that an AI probably isn't going to learn true things which are utterly divorced from and unimplied by the training distribution, I'd argue that the low-level constraints I'm talking about both leave freedom for learning wildly superhuman internal representations and directly incentivize it during extreme optimization. An "ideal predictor" wouldn't automatically start applying these capabilities towards any particular goal involving external world states by default, but it remains possible to elicit those capabilities incrementally.

Making the claim more concise: it seems effectively guaranteed that the natural optimization endpoint of one of these architectures would be plenty general to eat the universe if it were aimed in that direction. That process wouldn't need to involve slipping any of the low-level constraints.

I'm guessing the disconnect between our models is where the aiming happens. I'm proposing that the aiming is best (and convergently) handled outside the scope of wildly unpredictable and unconstrained optimization processes. Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way. These paths allow incremental refinement by virtue of not automatically summoning up incorrigible maximizers by default.

If the result of refinement isn't an incorrigible maximizer, then slipping the higher level "constraints" of this aiming process isn't convergent (or likely), and further, the nature of these higher-level constraints would be far more thorough than anything we could achieve with RLHF.

In fact, my model says there's no fundamental typological difference between "a practical heuristic on how to do a thing" and "a value" at the level of algorithmic implementation. It's only in the cognitive labels we-the-general-intelligences assign them.

That's pretty close to how I'm using the word "value" as well. Phrased differently, it's a question of how the agent's utilities are best described (with some asterisks around the non-uniqueness of utility functions and whatnot), and observable behavior may arise from many different implementation strategies—values, heuristics, or whatever.

Comment by porby on TurnTrout's shortform feed · 2023-11-07T04:32:06.225Z · LW · GW

I think we're using the word "constraint" differently, or at least in different contexts.

Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?

In terms of the type and scale of optimization constraint I'm talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sorts of complicated feedback loops in our massive multiagent environment—but it's nothing like the value constraints on the subset of predictors I'm talking about.

To be clear, I'm not suggesting "language models are tuned to be fairly close to our values." I'm making a much stronger claim that the relevant subset of systems I'm referring to cannot express unconditional values over external world states across anything resembling the training distribution, and that developing such values out of distribution in a coherent goal directed way practically requires the active intervention of a strong adversary. In other words:

A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it's able to disregard them.

...

These constraints do not generalize as fast as a homunculus' understanding goes.

I see no practical path for a homunculus of the right kind, by itself, to develop and bypass the kinds of constraints I'm talking about without some severe errors being made in the design of the system.

Further, this type of constraint isn't the same thing as a limitation of capability. In this context, with respect to the training process, bypassing these kinds of constraints is kind of like a car bypassing having-a-functioning-engine. Every training sample is a constraint on what can be expressed locally, but it's also information about what should be expressed. They are what the machine of Bayesian inference is built out of.

In other words, the hard optimization process is contained to a space where we can actually have reasonable confidence that inner alignment with the loss is the default. If this holds up, turning up the optimization on this part doesn't increase the risk of value drift or surprises, it just increases foundational capability.

The ability to use that capability to aim itself is how the foundation becomes useful. The result of this process need not result in a coherent maximizer over external world states, nor does it necessarily suffer from coherence death spirals driving it towards being a maximizer. It allows incremental progress.

(That said: this is not a claim that all of alignment is solved. These nice properties can be broken, and even if they aren't, the system can be pointed in catastrophic directions. An extremely strong goal agnostic system like this could be used to build a dangerous coherent maximizer (in a nontrivial sense); doing so is just not convergent or particularly useful.)

Comment by porby on TurnTrout's shortform feed · 2023-11-06T20:31:59.810Z · LW · GW

My model says that general intelligence[1] is just inextricable from "true-goal-ness". It's not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it's that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.

I've got strong doubts about the details of this. At the high level, I'd agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.

My reasoning would be that we're bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don't produce the thing we want. When this kind of optimization is pushed too far, it's not merely dangerous; it's useless.

I don't think this is temporary ignorance about how to do RL (or things with similar training dynamics). It's fundamental:

  1. Sparse and distant reward functions in high dimensional spaces give the optimizer an extremely large space to roam. Without bounds, the optimizer is effectively guaranteed to find something weird.
  2. For almost any nontrivial task we care about, a satisfactory reward function takes a dependency on large chunks of human values. The huge mess of implicit assumptions, common sense, and desires of humans are necessary bounds during optimization. This comes into play even at low levels of capability like ChatGPT.

Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target. The "values" that can be expressed in pretrained predictors are forced into conditionalization as a direct and necessary part of training; for a reasonably diverse dataset, the resulting model can't express unconditional preferences regarding external world states. While it's conceivable that some form of "homunculi" could arise, their ability to reach out of their appropriate conditional context is directly and thoroughly trained against.

In other words, the core capabilities of the system arise from a form of training that is both densely informative and blocks the development of unconditional values regarding external world states in the foundational model.

Better forms of fine-tuning, conditioning, and activation interventions (the best versions of each, I suspect, will have deep equivalences) are all built on the capability of that foundational system, and can be directly used to aim that same capability. Learning the huge mess of human values is a necessary part of its training, and its training makes eliciting the relevant part of those values easier—that necessarily falls out of being a machine strongly approximating Bayesian inference across a large dataset.

The final result of this process (both pretraining and conditioning or equivalent tuning) is still an agent that can be described as having unconditional preferences about external world states, but the path to get there strikes me as dramatically more robust both for safety and capability.

Summarizing a bit: I don't think it's required to directly incentivize NNs to form value-laden homunculi, and many of the most concerning paths to forming such homunculi seem worse for capabilities.

Comment by porby on Parametrically retargetable decision-makers tend to seek power · 2023-11-03T22:16:09.715Z · LW · GW

If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?

Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn't seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:

  1. Foundational goal agnosticism evades optimizer-induced automatic doom, and 
  2. Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
  3. They open the door to incrementally building a system that holds the entirety of a safe wish.

Things like "caring about means," or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I'm not sure the same can be said for any implementation that tries to get around the need for that complexity.

It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.

Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-11-02T22:53:24.748Z · LW · GW

In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.

If you successfully gave a strong maximizer the goal of maximizing a goal agnostic utility function, yes, you could then draw a box around the resulting system and correctly call it goal agnostic.

In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low n of the spontaneous fighting danse of two thermostats. If the latter can be described as goal agnostic, I don’t think the former shall not (hence my examples of environmental constraints that could let someone use your or my personality as a certified subprogram).

Composing multiple goal agnostic systems into a new system, or just giving a single goal agnostic system some trivial scaffolding, does not necessarily yield goal agnosticism in the new system. It won't necessarily eliminate it, either; it depends on what the resulting system is.

Yes, but shall we also agree that non-goal agnostic agents can produce goal agnostic agent?

Yes; during training, a non-goal agnostic optimizer can produce a goal agnostic predictor.

Comment by porby on Symbol/Referent Confusions in Language Model Alignment Experiments · 2023-10-27T19:44:36.257Z · LW · GW

I agree with the specific claims in this post in context, but the way they're presented makes me wonder if there's a piece missing which generated that presentation.

And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.

It is correct to say that, if you know nothing about the nature of the system's execution, this kind of natural language query is very little information. A deceptive system could output exactly the same thing. It's stronger evidence that the system isn't an agent that's aggressively open with its incorrigibility, but that's pretty useless.

If you somehow knew that, by construction of the underlying language model, there was a strong correlation between these sorts of natural language queries and the actions taken by a candidate corrigible system built on the language model, then this sort of query is much stronger evidence. I still wouldn't call it strong compared to a more direct evaluation, but in this case, guessing that the maybeCorrigibleBot will behave more like the sample query implies is reasonable.

In other words:

Me: Yet more symbol-referent confusion! In fact, this one is a special case of symbol-referent confusion which we usually call “gullibility”, in which one confuses someone’s claim of X (the symbol) as actually implying X (the referent).

If you intentionally build a system where the two are actually close enough to the same thing, this is no longer a confusion.

If my understanding of your position is correct: you wouldn't disagree with that claim, but you would doubt there's a good path to a strong corrigible agent of that approximate form built atop something like modern architecture language models but scaled up in capability. You would expect many simple test cases with current systems like RLHF'd GPT4 in an AutoGPT-ish scaffold with a real shutdown button to work but would consider that extremely weak evidence about the safety properties of a similar system built around GPT-N in the same scaffold.

If I had to guess where we might disagree, it would be in the degree to which language models with architectures similar-ish to current examples could yield a system with properties that permit corrigibility. I'm pretty optimistic about this in principle; I think a there is a subset of predictive training that yields high capability with an extremely constrained profile of "values" that make the system goal agnostic by default. I think there's a plausible and convergent path to capabilities that routes through corrigible-ish systems by necessity and permits incremental progress on real safety.

I've proven pretty bad at phrasing the justifications concisely, but if I were to try again: the relevant optimization pressures during the kinds of predictive training I'm referring to directly oppose the development of unconditional preferences over external world states, and evading these constraints carries a major complexity penalty. The result of extreme optimization can be well-described by a coherent utility function, but one representing only a conditionalized mapping from input to output. (This does not imply or require cognitive or perceptual myopia. This also does not imply that an agent produced by conditioning a predictor remains goal agnostic.)

A second major piece would be that this subset of predictors also gets superhumanly good at "just getting what you mean" (in a particular sense of the phrase) because it's core to the process of Bayesian inference that they implement. They squeeze enormous amount of information out of every available source of conditions and stronger such models do even more. This doesn't mean that the base system will just do what you mean, but it is the foundation on which you can more easily build useful systems.

There are a lot more details that go into this that can be found in other walls of text.

On a meta level:

That conversation we just had about symbol/referent confusions in interpreting language model experiments? That was not what I would call an advanced topic, by alignment standards. This is really basic stuff. (Which is not to say that most people get it right, but rather that it's very early on the tech-tree.) Like, if someone has a gearsy model at all, and actually thinks through the gears of their experiment, I expect they'll notice this sort of symbol/referent confusion.

I've had the occasional conversation that, vibes-wise, went in this direction (not with John).

It's sometimes difficult to escape that mental bucket after someone pattern matches you into it, and it's not uncommon for the heuristic to result in one half the conversation sounding like this post. There have been times where the other person goes into teacher-mode and tries e.g. a socratic dialogue to try to get me to realize an error they think I'm making, only to discover at the end some minutes later that the claim I was making was unrelated and not in contradiction with the point they were making.

This isn't to say "and therefore you should put enormous effort reading the manifesto of every individual who happens to speak with you and never use any conversational heuristics," but I worry there's a version of this heuristic happening at the field level with respect to things that could sound like "language models solve corrigibility and alignment."

Comment by porby on FAQ: What the heck is goal agnosticism? · 2023-10-26T16:37:48.470Z · LW · GW

In this very sense, one cannot want an external world state that is already in place, correct?

An agent can have unconditional preferences over world states that are already fulfilled. A maximizer doesn't stop being a maximizer if it's maximizing.

Let’s say we want to maximize the number of digits of pi we explicitly know.

That's definitely a goal, and I'd describe an agent with that goal as both "wanting" in the previous sense and not goal agnostic.

Also, what about the thermostat question above?

If the thermostat is describable as goal agnostic, then I wouldn't say it's "wanting" by my previous definition. If the question is whether the thermostat's full system is goal agnostic, I suppose it is, but in an uninteresting way.

(Note that if we draw the agent-box around 'thermostat with temperature set to 72' rather than just 'thermostat' alone, it is not goal agnostic anymore. Conditioning a goal agnostic agent can produce non-goal agnostic agents.)