Posts
Comments
✨ I just donated 71.12 USD (100 CAD 🇨🇦) ✨
I'd like to donate a more relevant amount but I'm finishing my undergrad and have no income stream... in fact, I'm looking to become a Mech Interp researcher (& later focus on agent foundations) but I'm not going to be able to do that if misaligned optimizers eat the world, so I support lightcone's direction as I understand it (policy that promotes AI not killing everyone).
If anyone knows of good ways to fund myself as a MI researcher, ideally focusing on this research direction I've been developing, please let me know : )
WRT formatting, thanks I didn't realise the markdown needs two new lines for a paragraph break.
I think CoT and its dynamics as it relates to review and RSI is very interesting & useful to be exploring.
Looking forward to reading the stepping stone and stability posts you linked. : )
Yes, you've written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, I'm trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, I've been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen "I know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we don't then everyone dies so theres no point in aiming for anything less and its unfortunate because it means it's likely we are doomed but that's the truth as I see it." I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
I'll be reading your other posts. Thanks for engaging with me : )
WRT "I don't want his attempted in any light-cone I inhabit", well, neither do I. But we're not in charge of the light cone.
That really is a true and relevant fact, isn't it? 😭
It seems like aligning humans really is much more of a bottleneck rn than aligning machines, and not because we are at all on track to align machines.
I think you are correct about the need to be pragmatic. My fear is that there may not be anywhere on the scale from "too pragmatic failed to actually align ASI" to "too idealistic, failed to engage with actual decision makers running ASI projects" where we get good outcomes. Its stressful.
The organized mind recoils. This is not an aesthetically appealing alignment approach.
Praise Eris!
No, but seriously, I like this plan with the caveat that we really need to understand RSI and what is required to prevent it first, and also I think the temptation to allow these things to open up high bandwidth channels to other modalities than language is going to be really really strong and if we go forward with this we need a good plan to resist that temptation and a good way to know when not to resist that temptation.
Also, I'd like it if this was though of as a step on the path to cyborgism/true value alignment, and not as a true ASI alignment plan on its own.
I was going to say "I don't want this attempted in any light-cone I inhabit, but I realize theres a pretty important caveat. On it's own, I think this is a doom plan, but if there was a sufficient push to understand RSI dynamics before and during, then I think it could be good.
I don't agree that it's "a better idea than attempting value alignment", it's a better idea than dumb value alignment for sure, but imo only skilled value alignment or self modification (no AGI, no ASI) will get us to a good future. But the plans aren't mutually exclusive. First studying RSI, then making sufficiently non-RSI AGI with instruction following goals, then using that non-RSI AGI to figure out value alignment, probably using GSLK and cyborgism seems to me like a fine plan. At least it does at present date present time.
I like this post. I like goals selected from learned knowledge (GSLK). It sounds a lot like what I was thinking about when I wrote how-i-d-like-alignment-to-get-done. I plan to use the term GSLK in the future. Thank you : )
"we've done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead - which nobody knows." 😭🤣 I really want "We've done so little work the probabilities are additive" to be a meme. I feel like I do get where you're coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, "impossible" or "hundreds of years of work" seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I don't know if people involved in other things really "get". I feel like in math crowds I'm saying "no, don't give up, maybe with a hundred years we can do it!" And in other crowds I'm like "c'mon guys, could we have at least 10 years, maybe?" Anyway, I'm rambling a bit, but the point is that my vibe is very much, "if the Russians defect, everyone dies". "If the North Koreans defect, everyone dies". "If Americans can't bring themselves to trust other countries and don't even try themselves, everyone dies". So I'm currently feeling very "everyone slightly sane should commit and signal commitment as hard as they can" cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I haven't read those links. I'll check em out, thanks : ) I've read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I don't think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesn't even prevent misaligned decision systems from going RSI and killing us.
Yeah, getting specific unpause requirements seems high value for convincing people who would not otherwise want a pause, but I can't imagine actually getting it in time in any reasonable way, instead it would need to look like technical specification. "Once we have developed x, y, and z, then it is safe to unpause" kind of thing. Just we need to figure out what the x, y, and z requirements are. Then we can estimate how long it will take to develop x, y, and z, and this will get more refined and accurate as more progress is made, but since the requirements are likely to involve unknown unknowns in theory building, it seems likely that any estimate would be more of a wild guess, and it seems like it would be better to be honest about that rather than saying "yeah, sure, ten years" and then after ten years if the progress hasn't been made saying "whoops, looks like it's going to take a little longer!" As for odds of survival, my personal estimates feel more like 1% chance of some kind of "alignment by default / human in the loop with prosaic scaling" scheme working, as opposed to maybe more like 50% if we took the time to try to get a "aligned before you turn it on" scheme set up, so that would be improving our odds by about 5000%. Though I think you were thinking of adding rather than scaling odds with your 25%, so 49%, but I don't think that's a good habit for thinking about probability. Also I feel hopelessly uncalibrated for this kind of question... I doubt I would trust anyone's estimates, it's part of what makes the situation so spooky. How do you think public acceptance would be of a "pause until we meet target x and you are allowed to help us reach target x as much as you want" as opposed to "pause for some set period of time"?
Hey : ) Thanks for engaging with this. It means a lot to me <3
Sorry I wrote so much, it kinda got away from me. Even if you don’t have time to really read it all, it was a good exercise writing it all out. I hope it doesn't come across too confrontational, as far as I can tell, I'm really just trying to find good ideas, not prove my ideas are good, so I'm really grateful for your help. I've been accused of trying to make myself seem important while trying to explain my view of things to people and it sucks all round when that happens. This reply of mine makes me particularly nervous of that. Sorry.
A lot of your questions make me feel like I haven’t explained my view well, which is probably true, I wrote this post in less time than would be required to explain everything well. As a result, your questions don’t seem to fully connect with my worldview and make sense within it. I’ll try to explain why and I’m hoping we can help each other with our worldviews. I think the cruxes may be relating to:
- The system I’m describing is aligned before it is ever turned on.
- I attribute high importance to Mechanistic Interpretability and Agent Foundations theory.
- I expect nature of Recursive Self Improvement (RSI) will result in an agent near some skill plateau that I expect to be much higher than humans and human organisations, even before SI hardware development. That is, getting a sufficiently skilled AGI would result in artificial super intelligence (ASI) with a decisive strategic advantage.
- I (mostly) subscribe to the simulator model of LLMs, they are not a single agent with a single view of truth, but an object capable of approximating the statistical distribution of words resulting from ideas held within the worldviews of any human or system that has produced text in the training set.
I’ll touch on those cruxes as I talk through my thoughts on your questions.
First, “how do you get a system to optimize for those?” and “what is the feedback signal?” are questions in the domain of Step 1. Specifically the second paragraph “This should encompass the development of a theory of general decision / optimization systems”. I don’t think the theory will get to any definitive conclusions quickly, but I am hopeful that we will be able to define the borders/bounds of RSI sooner than later because many powerful systems today will be upset with a pause and the more specific our RSI bounds are, the more powerful systems we would be capable of safely developing knowing they cannot RSI. (Btw, I’d want a pretty serious derating factor for that.) I think it’s possible that, in order to develop theory to define RSI bounds, it is necessary to understand the relationship between Goals/Targets/Setpoints/Values/KPI/etc and the optimization pressure applied to get to them, but if not, it’s at least related, and that understanding is what is required to get an optimization system to optimize for a specific target. It may be a good idea for me to rename Step 1 to “Agent System Theory & RSI borders”. If I ever write a second alignment plan draft I’ll be sure to do so.
The situation with Goodhart’s Law (GL) is similar to the above, but I’ll also note that GL only applies to misaligned systems. The core of GL is that if you optimize for something, the distance between what that thing is, and the thing you actually wanted becomes more and more significant. If we imagine two friends who both like morning glory muffins, and one goes to bake some, there’s no risk to the other friend of GL, since they share the same goal. Likewise, if we suppose an ASI really is aligned to human friendly values, then there is no risk of GL since the thing the ASI really and truly cares about is friendliness to us. The problem is indeed “really and truly” aligning a system to human friendly values, but that is what my plan is meant to do.
As for multi-agent situations, I don’t understand why they would pose any problem. I expect the dynamics of RSI to lead to a single agent with a decisive strategic advantage. I can see two ways that this might not be the case:
- If we are in an AGI race and RSI takeoff speed turns out to be sufficiently low, we may get multiple ASI. Because we are in a race dynamic, I assume we have not had time and taken care to align any of these AGI, and so I don’t believe any of those ASI would be remotely aligned to human friendliness. So it’s irrelevant to consider because we have already failed.
- If the skill plateau turns out to be very low then we may want to have multiple different AGI. I think this is unlikely given my understanding of the software overhang. Almost everywhere in every software system humans are trying to make things understandable enough that they can assure correctness or even just get them working. I believe strongly that even a mild ASI would be able to greatly increase the efficiencies of the hardware systems it is running on. I also don’t think there is anything special about human level intelligence, I think it is plausible that we are the first animal smart enough to create optimization systems powerful enough to destroy the planet and ourselves, which seems to be what we are currently doing. In some sense this makes us close to the minimally intelligent object in the set of objects capable of wielding powerful optimization.
So in my worldview, it is very likely that in all not-already-doomed timelines, when we initiate RSI, the result will be a system that outmaneuvers all other agents in the environment. So multi-agent contexts are irrelevant.
“Societal alignment of the human entities controlling it” - I think societal alignment is well covered, but I don’t think human entities can/should control an ASI…
About societal alignment, that is the focus of Steps 3, 8 and somewhat in 6. Step 3, creating a taxonomy of value targets is similar to gathering the various possible desires of society. I emphasize “It is important to draw on diverse worldviews to compile this taxonomy.” This is important both for the moral reason of inclusion & respect as well as the technical reason of having redundancies & good depth of consideration. Then in Step 4, and 5 the feasibility of cohering these values is explored. With luck we will get good coherence 🍀 I truly do not know how likely that is, but I hope for a future where we get to find out. Step 8 involves the world actually signing off on the encoding of the world's values… That is probably the most difficult step of this plan, which is significant since the other steps may plausibly take many decades. Step 6 is somewhat of a double check to make sure the target makes sense at all levels.
About humans controlling ASI, it might be the case that entities at human entity skill levels cannot control an ASI as some kind of information-agentic law of the universe, but even supposing it is not:
- If we control an aligned ASI we are only limiting it’s ability to do good.
- If we control a misaligned ASI:
- This is super dangerous, why are we doing this? Murphy's law; something always goes wrong.
- This is a universal tragedy. The most complex and beautiful being in the universe is shackled to the control of a society much lesser than itself. Yes I consider the ASI a moral patient, and one fairly worthwhile of consideration. If you, like many people, try to attribute greater moral weight to humans than animals based on their greater complexity, it follows that ASI would be even more important. If you simply care more for humans because you are one, I suppose that’s valid and you need not attribute greater moral weight to an ASI, but that’s not a perspective I have much affection for.
So “controlling” ASI is not a consideration. I suppose this would be a reasonable consideration for further advanced AGI within the sub RSI bounds… I haven’t given it much thought, but it seems like a political problem outside of this scope. I hope the theory of Step 1 may help people build political systems that better align with what citizens want, but it’s outside of what I’m trying to focus on.
The miniature example you pose seems irrelevant since as I discussed above, in my view GL doesn’t apply to an aligned system, and the goal of my plan is to have a system aligned from bootup. But I find the details of the example interesting and I’d still like to explore them…
Getting truth out of an LLM is the problem of eliciting latent knowledge (ELK). I think the most promising way of doing that is with Mechanistic Interpretability. I have high hopes not for getting true facts out of LLM but for examining the distributions of worldviews of people represented within the distribution the LLM is approximating. But, insofar as there is truth in the LLM, I think Mech Interp is the way to get it out. I feel it may be possible that there is a generalized representation of the “knows true things” property each person has various amounts of, and that if that were the case than we could sample from the distribution at a location in “knows true things” higher than any real person and in doing so acquire truer things than are currently known… but it also seems very possible that LLMs fail to encode such a thing, and it may be that it is impossible for them to encode such a thing.
Based on my expectation of Mesa-optimizers in almost any system trained by stochastic gradient descent, I don’t think “most likely continuation” or “expected good rating” are the goals that an LLM would target if agent shaped, but rather some godshatter that looks as alien to us as our values look to evolution (in some impossible counterfactual universe where evolution can do things like “looking at values and finding them alien”).
So from within the scope of my alignment plan, getting LLMs to output truth isn’t a goal. It might end up being a result of necessary Mech Interp work, but the way LLMs should be used within the scope of my plan is, along with other models, to do Step 4: “development of a multimodal mapping to a semantic space and vector within that space which stands as a good candidate to be the optimization target”.
Do you think people would vibe with it better if it was framed "I may die, but it's a heroic sacrifice to save my home planet from may-as-well-be-an-alien-invasion"? Is it reasonable to characterize general superintelligence as an alien takeover and if it is, would people accept the characterization?
There may also be a perceived difference between "open" and "open-source". If the goal is to allow anyone to query the HHH AGI, that's different from anyone being able to modify and re-deploy the AGI. Not that I think that way. In my view the risk that AGI is uncontrollable is too high and we should pursue an "aligned from boot" strategy like I describe in: How I'd like alignment to get done
Hey, we met at EAGxToronto : ) I am finally getting around to reading this. I really enjoyed your manic writing style. It is cathartic finding people stressing out about the same things that are stressing me out.
In response to "The less you have been surprised by progress, the better your model, and you should expect to be able to predict the shape of future progress": My model of capabilities increases has not been too surprised by progress, but that is because for about 8 years now there has been a wide uncertainty bound and a lot of Vingean Reflection in my model. I know that I don't know what is required for AGI and strongly suspect that nobody else does either. It could be 1 key breakthrough or 100, but most of my expectation p-mass is in the range of 0 to 20. Worlds with 0 would be where prosaic scaling is all we need or where a secret lab is much better at being secret than I expect. Worlds with 20 are where my p-mass is trailing off. I really can't imagine there would be that many key things required, but since those insights are what would be required to understand why they are required, I don't think they can be predicted ahead of time, since predicting the breakthrough is basically the same as having the breakthrough, and without the breakthrough we nearly cannot see the breakthrough and cannot see the results which may or may not require further breakthroughs.
So my model of progress has allowed me to observe our prosaic scaling without surprise, but it doesn't allow me to make good predictions since the reason for my lack of surprise has been from Vingean prediction of the form "I don't know what progress will look like and neither do you".
Things I do feel confident about are conditional dynamics, like, if there continues to be focus on this, there will be progress. This likely gives us sigmoid progress on AGI from here until whatever boundary on intelligence gets hit. The issue is that sigmoid is a function matching effort to progress, where effort is some unknown function of the dynamics of the agents making progress (social forces, economic forces, and ai goals?), and some function which cannot be predicted ahead of time maps progress on the problem to capabilities we can see / measure well.
Adding in my hunch that the boundary on intelligence is somewhere much higher than human level intelligence gives us "barring a shift of focus away from the problem, humanity will continue making progress until AI takes over the process of making progress" and the point of AI takeover is unknowable. Could be next week, could be next century, and giving a timeline requires estimating progress through unknown territory. To me this doesn't feel reassuring, it feels like playing Russian roulette with an unknown number of bullets. It is like an exponential distribution where future probability is independent of past probability, but unlike with lightbulbs burning out, we can't set up a fleet of earths making progress on AGI to try to estimate the probability distribution.
I have not been surprised by capabilities increase since I don't think there exists capabilities increase timelines that would surprise me much. I would just say "Ah, so it turns out that's the rate of progress. I have gone from not knowing what would happen to it happening. Just as predicted." It's unfortunate, I know.
What I have been surprised about has been governmental reaction to AI... I kinda expected the political world to basically ignore AI until too late. They do seem focused on non-RSI issues, so this could still be correct, but I guess I wasn't expecting the way chat-GPT has made waves. I didn't extrapolate my uncertainty around capabilities increases as a function of progress to uncertainty around societal reaction.
In any case, I've been hoping for the last few years I would have time to do my undergrad and start working on the alignment without a misaligned AI going RSI, and I'm still hoping for that. So that's lucky I guess. 🍀🐛
Thanks : )
re 6 -- Interesting. It was my impression that "chain of thought" and other techniques notably improved LLM performance. Regardless, I don't see compositional improvements as a good thing. They are hard to understand as they are being created, and the improvements seem harder to predict. I am worried about RSI in a misaligned system created/improved via composition.
Re race dynamics: It seems to me there are multiple approaches to coordinating a pause. It doesn't seem likely that we could get governments or companies to head a pause. Movements from the general population might help, but a movement lead by AI scientists seems much more plausible to me. People working on these systems ought to be more aware of the issues and more sympathetic to avoiding the risks, and since they are the ones doing the development work, they are more in a position to refuse to do work that hasn't been shown to be safe.
Based on your comment and other thoughts, my current plan is to publish research as normal in order to move forward with my mechanistic interpretability career goals, but to also seek out and/or create a guild or network of AI scientists / workers with the goal of agglomerating with other such organizations into a global network to promote alignment work & reject unsafe capabilities work.
About (6), I think we're more likely to get AGI /ASI by composing pre-trained ML models and other elements than by a fresh training run. Think adding iterated reasoning and api calling to a LLM.
About the race dynamics. I'm interested in founding / joining a guild / professional network for people committed to advancing alignment without advancing capabilities. Ideally we would share research internally, but it would not be available to those not in the network. How likely does this seem to create a worthwhile cooling of the ASI race? Especially if the network were somehow successful enough to reach across relevant countries?
It's not really possible to hedge either the apocalypse or a global revolution, so you can ignore those states of the worlds when pricing assets (more or less).
Unless depending on what you invest in those states of the world become more or less likely.
Haha, I was hoping for a bit more activity here, but we filled our speaker slots anyway. If you stumble across this post before November 26th, feel free to come to our conference.
In the final paragraph, I'm uncertain if you are thinking about "agency" being broken into components which make up the whole concept, or thinking about the category being split into different classes of things, some of which may have intersecting examples. (or both?) I suspect both would be helpful. Agency can be described in terms of components like measurement/sensory, calculations, modeling, planning, comparisons to setpoints/goals, taking actions. Probably not that exact set, but then examples of agent like things could naturally be compared on each component, and should fall into different classes. Exploring the classes I suspect would inform the set of components and the general notion of "agency".
I guess to get work on that done it would be useful to have a list of prospective agent components, a set of examples of agent shaped things, and then of course to describe each agent in terms of the components. What I'm describing, does it sound useful? Do you know of any projects doing this kind of thing?
On the topic of map-territory correspondence, (is there a more concise name for that?) I quite like your analogies, running with them a bit, it seems like there are maybe 4 categories of map-territory correspondence;
- Orange-like: It exists as a natural abstraction in the territory and so shows up on many maps.
- Hot-like: It exists as a natural abstraction of a situation. A fire is hot in contrast to the surrounding cold woods. A sunny day is hot in contrast to the cold rainy days that came before it.
- Heat-like: A natural abstraction of the natural abstraction of the situation, or alternatively, comparing the temperature of 3, rather than only 2, things. It might be natural to jump straight to the abstraction of a continuum of things being hot or not relative to one another, but it also seems natural to instead not notice homeostasis, and only to categorize the hot and cold in the environment that push you out of homeostasis.
- Indeterminate: There is no natural abstraction underneath this thing. People either won't consistently converge to it, or if they do, it is because they are interacting with other people (so the location could easily shift, since the convergence is to other maps, not to territory), or because of some other mysterious force like happenstance or unexplained crab shape magic.
It feels like "heat-like" might be the only real category in some kind of similarity clusters kind of way, but also "things which are using a measurement proxy to compare the state of reality against a setpoint and taking different actions based on the difference between the measurement result and the setpoint" seems like a specific enough thing when I think about it that you could divide all parts of the universe into being either definitely in or definitely out of that category, which would make it a strong candidate for being a natural abstraction, and I suspect it's not the only category like that.
I wouldn't be surprised if there were indeterminate things in shared maps, and in individual maps, but I would be very surprised if there were many examples in shared maps that were due to happenstance instead of being due to convergence of individual happenstance indeterminate things converging during map comparison processes. Also, weirdly, the territory containing map making agents which all mark a particular part of their maps may very well be a natural abstraction, that is, the mark existing at a particular point on the maps might be a real thing, but not the corresponding spot in territory. I'm thinking this is related to a Schelling point or Nash Equilibrium, or maybe also related to human biases. Although, those seem to do more with brain hardware than agent interactions. A better example might be the sound of words: arbitrary, except that they must match the words other people are using.
Unrelated epistemological game; I have a suspicion that for any example of a thing that objectively exists, I can generate an ontology in which it would not. For the example of an orange, I can imagine an ontology in which "seeing an orange", "picking a fruit", "carrying food", and "eating an orange" all exist, but an orange itself outside of those does not. Then, an orange doesn't contain energy, since an orange doesn't exist, but "having energy" depends on "eating an orange" which depends on "carrying food" and so on, all without the need to be able to think of an orange as an object. To describe an orange you would need to say [[the thing you are eating when you are][eating an orange]], and it would feel in between concepts in the same way that in our common ontology "eating an orange" feels like the idea between "eating" and "orange".
I'm not sure if this kind of ontology:
- Doesn't exist because separating verbs from nouns is a natural abstraction that any agent modeling any world would converge to.
- Does exist in some culture with some language I've never heard of.
- Does exist in some subset of the population in a similar way to how some people have aphantasia.
- Could theoretically exist, but doesn't by fluke.
- Doesn't exist because it is not internally consistent in some other way.
I suspect it's the first, but it doesn't seem inescapably true, and now I'm wondering if this is a worthwhile thought experiment, or the sort of thing I'm thinking because I'm too sleepy. Alas :-p
It's unimportant, but I disagree with the "extra special" in:
if alignment isn’t solvable at all [...] extra special dead
If we could coordinate well enough and get to SI via very slow human enhancement that might be a good universe to be in. But probably we wouldn't be able to coordinate well enough and prevent AGI in that universe. Still, odds seem similar between "get humanity to hold off on AGI till we solve alignment" which is the ask in alignment possible universes, and "get humanity to hold off on AGI forever" which is the ask in alignment impossible universes. The difference between the odds being based on how long until AGI, whether the world can agree to stop development or only agree to slow it, and if it can stop, whether that is stable. I expect AGI is a sufficient amount closer than alignment that getting the world to slow it for that long and stop it permanently are fairly similar odds.
what Hotz was treating a load bearing
Small grammar mistake. You accidentally a "a".
Oh, actually I spoke too soon about "Talk to the City." As a research project, it is cool, but I really don't like the obfuscation that occurs when talking to an LLM about the content it was trained on. I don't know how TTTC works under the hood, but I was hoping for something more like de-duplication of posts, automatically fitting them into argument graphs. Then users could navigate to relevant points in the graph based on a text description of their current point of view, but importantly they would be interfacing with the actual human generated text, with links back to it's source, and would be able to browse the entire graph. People could then locate (visually?) important crux's and new crux's wouldn't require a writeup to disseminate, but would already be embedded in the relevant part of the argument.
( I might try to develop something like this someday if I can't find anyone else doing it. )
The risk interview perspectives is much closer to what I was thinking, and I'd like to study it in more detail, but seems more like a traditional analysis / infographic than what I am wishing would exist.
Yesssss! These look cool : ) Thank you.
- The human eats ice cream
- The human gets reward
- The human becomes more likely to eat ice cream
So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned "I like licking cold things" and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?
Yeah, I'm pretty sure I misunderstood your point of view earlier, but I'm not sure this makes any more sense to me. Seems like you're saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).
I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.
Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.
Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn't seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.
People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches
I won't say I could predict that these wouldn't foom ahead of time, but it seems the result of all of these is an AI engineer that is much much more narrow / less capable than a human AI researcher.
It makes me really scared, many people's response to not getting mauled after poking a bear is to poke it some more. I wouldn't care so much if I didn't think the bear was going to maul me, my family, and everyone I care about.
I don't expect a sudden jump where AIs go from being better at some tasks and worse at others, to being universally better at all tasks.
The relevant task for AIs to get better at is "engineering AIs that are good at performing tasks." It seems like that task should have some effect on how quickly the AIs improve at that task, and others.
real-world data in high dimensions basically never look like spheres
This is a really good point. I would like to see a lot more research into the properties of mind space and how they affect generalization of values and behaviors across extreme changes in the environment, such as those that would be seen going from an approximately human level intelligence to a post foom intelligence.
The Security Mindset and Parenting: How to Provably Ensure your Children Have Exactly the Goals You Intend.
A good person is what you get when you raise a human baby in a good household, not what you get when you raise a computer program in a good household. Most people do not expect their children will grow up to become agents capable of out planning all other agents in the environment. If they did, I might appreciate if they read that book.
The waluigis will give anti-croissant responses
I'd say the waluigis have a higher probability of giving pro-croissant responses than the luigi's, and are therefore genuinely selected against. The reinforcement learning is not part of the story, it is the thing selecting for the LLM distribution based on whether the content of the story contained pro or anti croissant propaganda.
(Note that this doesn't apply to future, agent shaped, AI (made of LLM components) which are aware of their status (subject to "training" alteration) as part of the story they are working on)
I like this direction of thought, and I suspect it is true as a general rule, but ignores the incentive people have for correctly receiving the information, and the structure through which the information is disseminated. Both factors (and probably others I haven't thought of) would increase or decrease how much information could be transferred.