Posts
Comments
One of the core problems of AI alignment is that we don't know how to reliably get goals into the AI - there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.
Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.
These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won't we get some weird thing that happens to correlate with "don't die, keep your options open" on the training data, but falls apart out of distribution?
I found an error in the application - when removing the last item from the blacklist, every page not whitelisted is claimed to be blacklisted. Adding an item back to the blacklist fixes this. Other than that, it looks good!
Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn't finetuned, if there's one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a "helpful, harmless, honest" directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia.
Does the model then:
- Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis)
- Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way - this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI's)
One hard part is that it's difficult to disentangle "Competently lies about the location of Canada" and "Actually believes, insomuch as a language model believes anything, that Canada is in Asia now", but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.
In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest.
According to my understanding of RLHF, the goal-approximation it trains for is "Write a prompt that is likely to be rated as positive". In ChatGPT / Claude, this is indeed highly correlated with being helpful, harmless, and honest, since the model's best strategy for getting high ratings is to be those things. If models are smarter than us, this may cease to be the case, as being maximally honest may begin to conflict with the real goal of getting a positive rating. (e.g, if the model knows something the raters don't, it will be penalised for telling the truth, which may optimise for deceptive qualities) Does this seem right?
I don't really understand how your central point applies here. The idea of "money saves lives" is not supposed to be a general rule of society, but rather a local point about Alice and Bob - namely, donating ~5k will save a life. That doesn't need to be always true under all circumstances, there just needs to be some repeatable action that Alice and Bob can take (e.g, donating to the AMF) that costs 5k for them that reliably results in a life being saved. (Your point about prolonging life is true, but since the people dying of malaria are generally under 5, the amount of QALY's produced is pretty close to an entire human lifetime)
It doesn't really matter, for the rest of the argument, how this causal relationship works. It could be that donating 5k causes more bednets to be distributed, it could be that donating 5k allows for effective lobbying to improve economic growth to the value of one life, or it could be that the money is burnt in a sacrificial pyre to the God of Charitable Sacrifices, who then descends from the heavens and miraculously cures a child dying of malaria. From the point of view of Alice and Bob, the mechanism isn't important if you're talking on the level of individual donations.
In other words, Alice and Bob are talking on the margins here, and on the margin, 5k spent equals one live saved, at least for now.
Not quite, in my opinion. In practice, humans tend to be wrong in predictable ways (what we call a "bias") and so picking the best option isn't easy.
What we call "rationality" tends to be the techniques / thought patterns that make us more likely to pick the best option when comparing alternatives.
How about "AI-assisted post"? Shouldn't clash with anything else, and should be clear what it means on seeing the tag.
"Reward" in the context of reinforcement learning is the "goal" we're training the program to maximise, rather than a literal dopamine hit. For instance, AlphaGo's reward is winning games of Go. When it wins a game, it adjusts itself to do more of what won it the game, and the other way when it loses. It's less like the reward a human gets from eating ice-cream, and more like the feedback a coach might give you on your tennis swing that lets you adjust and make better shots. We have no reason to suspect there's any human analogue to feeling good.
I think intelligence is a lot easier than morality, here. There are agreed upon moral principles like not lying, not stealing, and not hurting others, sure...but even those aren't always stable across time. For instance, standard Western morality held that it was acceptable to hit your children a couple of generations ago, now standard Western morality says it's not. If an AI trained to be moral said that actually, hitting children in some circumstances is a worthwhile tradeoff, that could mean that the AI is more moral than we are and we overcorrected, or it could mean that the AI is less moral than we are and is simply wrong.
And that's just for the same values! What about how values change over the decades? If our moral AI says that a Confucianism obeying of parental authority is just, and that us Westerners are actually wrong about this, how do we know whether it's correct?
Intelligence tests tend to have a quick feedback loop. The answer is right or wrong. If a Go-playing AI makes a move that looks bizarre but then wins the game, that's indicative that it's superior. Morality is more like long-term planning - if a policy-making AI suggests a strange policy, we have no immediate way to judge whether this is good or not, because we don't have access to the ground truth of whether or not it works for a long time.
Similar with alignment. How do we know that a superhuman alignment solution would look reasonable to us instead of weird? (Also, for that matter, why would a more moral agent have better alignment solutions? Do you think that the blocker for good alignment solutions are that current alignment researchers are insufficiently virtuous to come up with correct solutions?)
If NAMSI achieved a superhuman level of expertise in morality, how would we know? I consider our society to be morally superior to the one we had in 1960. People in 1960 would not agree with this assessment upon looking. If NAMSI agrees with us about everything, it's not superhuman. So how do we determine whether its possibly-superhuman morality is superior or inferior?
That may explain why these scenarios have never been all that appealing to me, because I do think about the future in these hypothetical scenarios. I ask myself "Okay, what would the plan be in five years, when the scavenged food has long since run out?" and that feels scary and overwhelming. (Admittedly, rollercoaster scary, since it's a fantasy, but I find myself spending just as much time asking how the hell I'd learn to recreate agriculture and how miserable day-to-day farming would be as I do imagining myself as a badass hero who saves someone from zombies - and that's assuming I survive at all, which is a pretty big if!)
""AI alignment" has the application, the agenda, less charitably the activism, right in the name."
This seems like a feature, not a bug. "AI alignment" is not a neutral idea. We're not just researching how these models behave or how minds might be built neutrally out of pure scientific curiosity. It has a specific purpose in mind - to align AI's. Why would we not want this agenda to be part of the name?
What are the best ones you've got?
I don't think this is a good metric. It is very plausible that porn is net bad, but living under the type of govermnment that would outlaw it is worse. In which case your best bet would be to support its legality but avoid it yourself.
I'm not saying that IS the case, but it certainly could be. I definitely think there are plenty of things that are net-negative to society but nowhere near bad enough to outlaw.
An AGI that can answer questions accurately, such as "What would this agentic AGI do in this situation" will, if powerful enough, learn what agency is by default since this is useful to predict such things. So you can't just train an AGI with little agency. You would need to do one of:
- Train the AGI with the capabilities of agency, and train it not to use them for anything other than answering questions.
- Train the AGI such that it did not develop agency despite being pushed by gradient descent to do so, and accept the loss in performance.
Both of these seem like difficult problems - if we could solve either (especially the first) this would be a very useful thing, but the first especially seems like a big part of the problem already.
Late response but I figure people will continue to read these posts over time: Wedding-cake multiplication is the way they teach multiplication in elementary school. i.e, to multiply 706 x 265, you do 706 x 5, then 706 x 60, then 706 x 200 and add all the results together. I imagine it is called that because the result is tiered like a wedding cake.
One of the easiest ways to automate this is to have some sort of setup where you are not allowed to let things grow past a certain threshold, a threshold which is immediately obvious and ideally has some physical or digital prevention mechanism attached.
Examples:
Set up a Chrome extension that doesn't let you have more than 10 tabs at a time. (I did this)
Have some number of drawers / closet space. If your clothes cannot fit into this space, you're not allowed to keep them. If you buy something new, something else has to come out.
I know this is two years later, but I just wanted to say thank you for this comment. It is clear, correct, and well-written, and if I had seen this comment when it was written, it could have saved me a lot of problems at the time.
I've now resolved this issue to my satisfaction, but once bitten twice shy, so I'll try to remember this if it happens again!
Sorry it took me a while to get to this.
Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly "pass" the course, you would.
I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things "goals" (the things you really want for their own sake) and "conditions" (the things you need to do for some external reason)
My inner MIRI says - we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.
Are you thinking of Dynalist? I know Neel Nanda's interpretability explainer is written in it.
I think what the OP was saying was that in, say, 2013, there's no way we could have predicted the type of agent that LLM's are and that they would be the most powerful AI's available. So, nobody was saying "What if we get to the 2020s and it turns out all the powerful AI are LLM's?" back then. Therefore, that raises a question on the value of the alignment work done before then.
If we extend that to the future, we would expect most good alignment research to happen within a few years of AGI, when it becomes clear what type of agent we're going to get. Alignment research is much harder if, ten years from now, the thing that becomes AGI is as unexpected to us as LLM's were ten years ago.
Thus, there's not really that much difference, goes the argument, if we get AGI in five years with LLM's or fifteen years with God only knows what, since it's the last few years that matters.
A hardware overhang, on the other hand, would be bad. Imagine we had 2030's hardware when LLM's came onto the scene. You'd have Vaswani et al. coming out in 2032 and by 2035 you'd have GPT-8. That would be terrible.
Therefore, says the steelman, the best scenario is if we are currently in a slow takeoff that gives us time. Hardware overhang is never going to be lower again than it is now, and that ensures we are bumping up against not only conceptual understanding or engineering requirements but also the fundamental limits of compute, which limits how fast we can scale the LLM paradigm. This may not happen if we get a new type of agent in ten years.
We don't. Humans lie constantly when we can get away with it. It is generally expected in society that humans will lie to preserve people's feelings, lie to avoid awkwardness, and commit small deceptions for personal gain (though this third one is less often said out loud). Some humans do much worse than this.
What keeps it in check is that very few humans have the ability to destroy large parts of the world, and no human has the ability to destroy everyone else in the world and still have a world where they can survive and optimally pursue their goals afterwards. If there is no plan that can achieve this for a human, humans being able to lie doesn't make it worse.
I think this post, more than anything else, has helped me understand the set of things MIRI is getting at. (Though, to be fair, I've also been going through the 2021 MIRI dialogues, so perhaps that's been building some understanding under the surface)
After a few days reflection on this post, and a couple weeks after reading the first part of the dialogues, this is my current understanding of the model:
In our world, there are certain broadly useful patterns of thought that reliably achieve outcomes. The biggest one here is "optimisation". We can think of this as aiming towards a goal and steering towards it - moving the world closer to what you want it to be. These aren't things we train for - we don't even know how to train for them, or against them. They're just the way the world works. If you want to build a power plant, you need some way of getting energy to turn into electricity. If you want to achieve a task, you need some way of selecting a target and then navigating towards it, whether it be walking across the room to grab a cookie, or creating a Dyson sphere.
With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions". This does not stop it from being an optimisation problem. The AI will then naturally, with no prompting, attempt to find the best path that gets around these limitations. This probably means you end up with a path that gets the AI the things it wanted from the useful properties of deception or non-corrigibility while obeying the strict letter of the law. (After all, if deception / non-corrigibility wasn't useful, if it didn't help achieve something, you would not spontaneously get an agent that did this, without training it to do so) Again, this is an entirely natural thing. The shortest path between two points is a straight line. If you add an obstacle in the way, the shortest path between those two points is now to skirt arbitrarily close to the obstacle. No malice is required, any more than you are maliciously circumventing architects when you walk close to (but not walking into!) walls.
Basically - if the output of an optimisation process is dangerous, it doesn't STOP being dangerous by changing it into a slightly different optimisation process of "Achieve X thing (which is dangerous) without doing Y (which is supposed to trigger on dangerous things)". You just end up getting X through Y' instead, as long as you're still enacting the basic pattern - which you will be, because an AI that can't optimise things can't do anything at all. If you couldn't apply a general optimisation process, you'd be unable to walk across the room and get a cookie, let alone do all the things you do in your day-to-day life. Same with the AI.
I'd be interested in whether someone who understands MIRI's worldview decently well thinks I've gotten this right. I'm holding off on trying to figure out what I think about that worldview for now - I'm still in the understanding phase.
I like this dichotomy. I've been saying for a bit that I don't think "companies that only commercialise existing models and don't do anything that pushes forward the frontier" aren't meaningfully increasing x-risk. This is a long and unwieldy statement - I prefer "AI product companies" as a shorthand.
For a concrete example, I think that working on AI capabilities as an upskilling method for alignment is a bad idea, but working on AI products as an upskilling method for alignment would be fine.
Based on the language you've used in this post, it seems like you've tried several arguments in succession, none of them have worked, and you're not sure why.
One possibility might be to first focus on understanding his belief as well as possible, and then once you understand his conclusions and why he's reached them, you might have more luck. Maybe taking a look at Street Epistemology for some tips on this style of inquiry would help.
(It is also worth turning this lens upon yourself, and asking why is it so important to you that your friend believes that AGI is immiment? Then you can decide whether it's worth continuing to try to persuade him.)
If anyone writes this up I would love to know about it - my local AI safety group is going to be doing a reading + hackathon of this in three weeks, attempting to use the ideas on language models in practice. It would be nice to have this version for a couple of people who aren't experienced with AI who will be attending, though it's hardly gamebreaking for the event if we don't have this.
So, I notice that still doesn't answer the actual question of what my probability should actually be. To make things simple, let's assume that, if the sun exploded, I would die instantly. In practice it would have to take at least eight minutes, but as a simplifying assumption, let's assume it's instantaneous.
In the absence of relevant evidence, it seems to me like Laplace's Law of Succession would say the odds of the sun exploding in the next hour is 1/2. But I could also make that argument to say the odds of the sun exploding in the next year is also 1/2, which is nonsensical. So...what's my actual probability, here, if I know nothing about how the sun works except that it has not yet exploded, the sun is very old (which shouldn't matter, if I understand you correctly) and that if it exploded, we would all die?
I notice I'm a bit confused about that. Let's say the only thing I know about the sun is "That bright yellow thing that provides heat", and "The sun is really really old", so I have no knowledge about how the sun mechanistically does what it does.
I want to know "How likely is the sun to explode in the next hour" because I've got a meeting to go to and it sure would be inconvenient for the sun to explode before I got there. My reasoning is "Well, the sun hasn't exploded for billions of years, so it's not about to explode in the next hour, with very high probability."
Is this reasoning wrong? If so, what should my probability be? And how do I differentiate between "The sun will explode in the next hour" and "The sun will explode in the next year"?
This is a for-profit company, and you're seeking investment as well as funding to reduce x-risk. Given that, how do you expect to monetise this in the future? (Note: I think this is well worth funding for altruistic reduce-x-risk reasons)
A frame that I use that a lot of people I speak to seem to find A) Interesting and B) Novel is that of "idiot units".
An Idiot Unit is the length of time it takes before you think your past self was an idiot. This is pretty subjective, of course, and you'll need to decide what that means for yourself. Roughly, I consider my past self to be an idiot if they have substantially different aims or are significantly less effective at achieving them. Personally my idiot unit is about two years - I can pretty reliably look back in time and think that compared to year T, Jay at year T-2 had worse priorities or was a lot less effective at pursuing his goals somehow.
Not everyone has an Idiot Unit. Some people believe they were smarter ten years ago, or haven't really changed their methods and priorities in a while. Take a minute and think about what your Idiot Unit might be, if any.
Now, if you have an Idiot Unit for your own life, what does that imply?
Firstly, hill-climbing heuristics should be upweighted compared to long-term plans. If your Idiot Unit is U, any plan that takes more than U time means that, after U time, you're following a plan that was designed by an idiot. Act accordingly.
That said, a recent addition I have made to this - you should still make long-term plans. It's important to know which of your plans are stable under Idiot Units, and you only get that by making those plans. I don't disagree with my past self about everything. For instance, I got a vasectomy at 29, because not wanting kids had been stable for me for at least ten years, so I don't expect more Idiot Units to change this.
Secondly, if you must act according to long-term plans (A college/university degree takes longer than U for me, especially since U tends to be shorter when you're younger) try to pick plans that preserve or increase optionality. I want to give Future Jay as many options as possible, because he's smarter than me. When Past Jay decided to get a CS degree, he had no idea about EA or AI alignment. But a CS degree is a very flexible investment, so when I decided to do something good for the world, I had a ready-made asset to use.
Thirdly, longer-term investments in yourself (provided they aren't too specific) are good. Your impact will be larger a few years down the track, since you'll be smarter then. Try asking what a smarter version of you would likely find useful and seek to acquire that. Resources like health, money, and broadly-applicable knowledge are good!
Fourthly, the lower your Idiot Unit is, the better. It means you're learning! Try to preserve it over time - Idiot Units naturally grow with age so if yours stands still, you're actually making progress.
I'm not sure if it's worth writing up a whole post on this with more detail and examples, but I thought I'd get the idea out there in Shortform.
Explore vs. exploit is a frame I naturally use (Though I do like your timeline-argmax frame, as well), where I ask myself "Roughly how many years should I feel comfortable exploring before I really need to be sitting down and attacking the hard problems directly somehow"?
Admittedly, this is confounded a bit by how exactly you're measuring it. If I have 15-year timelines for median AGI-that-can-kill-us (which is about right, for me) then I should be willing to spend 5-6 years exploring by the standard 1/e algorithm. But when did "exploring" start? Obviously I should count my last eight months of upskilling and research as part of the exploration process. But what about my pre-alignment software engineering experience? If so, that's now 4/19 years spent exploring, giving me about three left. If I count my CS degree as well, that's 8/23 and I should start exploiting in less than a year.
Another frame I like is "hill-climbing" - namely, take the opportunity that seems best at a given moment. Though it is worth asking what makes something the best opportunity if you're comparing, say, maximum impact now vs. maximum skill growth for impact later.
I don't actually understand this, and I feel like it needs to be explained a lot more clearly.
"Whatever the fundamental physical reality of a moment of experience I'm suggesting that that reality changes as little as it can." - What does this mean? Using the word "can" here implies some sort of intelligence "choosing" something. Was that intended? If so, what is doing the choosing? If not, what is causing this property of reality?
"Because of this human beings are really just keeping track of themselves as models of objective reality, and their ultimate aim is in fact to know and embody the entirety of objective reality (not that any of them will succeed)." - Human beings don't seem to act in the way I would expect them to act if this was their goal. For instance, why do I choose to eat foods I know are tasty and take routes to work I've already taken, instead of seeking out new experiences every time and widening my understanding of objective reality? What difference would I expect to see in human behaviour if this ultimate aim was false?
"This sort of thinking becomes a next to nothing, but not quite nothing, requirement for any mind, regardless of how vastly removed from another mind it is, to have altruistic concern for any other mind in the absolute longest term (because their fully ultimate aim would have to be the exact same)." - I don't understand how this point follows from your previous ones, or even what the point actually is. Are you saying "All minds have the same fundamental aim, therefore we should be altruistic to each other"?
Corrigibility would render Chris's idea unnecessary, but doesn't actually argue against why Chris's idea wouldn't work. Unless there's some argument for "If you could implement Chris's idea, you could also implement corrigibility" or something along those lines.
Earlier in the book it's shown that Quirrell and Harry can't cast spells on each other without backlash. I'm sure Quirrell could get around that by, e.g, crushing him with something heavy, but why do something complicated, slow, and unnecessary when you can just pull a trigger?
Bad news - there is no definitive answer for AI timelines :(
Some useful timeline resources not mentioned here are Ajeya Cotra's report and a non-safety ML researcher survey from 2022, to give you an alternate viewpoint.
I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan's success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is - how large is the capability difference between "AI can produce a working plan for Y, but can't fool us into thinking it's a plan for X" and "AI can produce a working plan for Y that looks to us like a plan for X"?
The honest answer is "We don't know". Since failure could be catastrophic, this isn't something I'd like to leave to chance, even though I wouldn't go so far as to call the result inevitable.
I think the most likely scenario of actually trying this with an AI in real life is that you end up with a strategy that is convincing to humans and ends up being ineffective or unhelpful in reality, rather than ending up with a galaxy-brained strategy that pretends to produce X but actually produces Y while simultaneously deceiving humans into thinking it produces X.
I agree with you that "Come up with a strategy to produce X" is easier than "Come up with a strategy to produce Y AND convince the humans that it produces X", but I also think it is much easier to perform "Come up with a strategy that convinces the humans that it produces X" than to produce a strategy that actually works.
So, I believe this strategy would be far more likely to be useless than dangerous, but I still don't think it would help.
As a useful exercise, I would advise asking yourself this question first, and thinking about it for five minutes (using a clock) with as much genuine intent to argue against your idea as possible. I might be overestimating the amount of background knowledge required, but this does feel solvable with info you already have.
ROT13: Lbh lbhefrys unir cbvagrq bhg gung n fhssvpvragyl cbjreshy vagryyvtrapr fubhyq, va cevapvcyr, or noyr gb pbaivapr nalbar bs nalguvat. Tvira gung, jr pna'g rknpgyl gehfg n fgengrtl gung n cbjreshy NV pbzrf hc jvgu hayrff jr nyernql gehfg gur NV. Guhf, jr pna'g eryl ba cbgragvnyyl hanyvtarq NV gb perngr n cbyvgvpny fgengrtl gb cebqhpr nyvtarq NV.
From recent research/theorycrafting, I have a prediction:
Unless GPT-4 uses some sort of external memory, it will be unable to play Twenty Questions without cheating.
Specifically, it will be unable to generate a consistent internal state for this game or similar games like Battleship and maintain it across multiple questions/moves without putting that state in the context window. I expect that, like GPT-3, if you ask it what the state is at some point, it will instead attempt to come up with a state that has been consistent with the moves of the game so far on the fly, which will not be the same as what it would say if you asked it for the state as the game started. I do expect it to be better than GPT-3 at maintaining the illusion.
In the "Why would this be useful?" section, you mention that doing this in toy models could help do it in larger models or inspire others to work on this problem, but you don't mention why we would want to find or create steganography in larger models in the first place. What would it mean if we successfully managed to induce steganography in cutting-edge models?
I am not John, so I can't be completely sure what he meant, but here's what I got from reflection on the idea:
One way to phrase the alignment problem (At least if we expect AGI to be neural network based) is that the alignment problem is how to get a bunch of matrices into the positions we want them to be in. There is (hopefully) some set of parameters, made of matrices, for a given architecture that is aligned, and some training process we can use to get there.
Now, determining what those positions are is very hard - we need to figure out what properties we need, encode them in maths, and ensure the training process gets there and stays there. Nevertheless, at it's core, at least the last two of these are linear algebra problems, and if you were the God of Linear Algebra you could solve them. Since we can't solve them, we don't know enough linear algebra.
Thanks for clarifying!
So, in that case:
- What exactly is a hallucination?
- Are hallucinations sometimes desirable?
Regarding the section on hallucinations - I am confused why the example prompt is considered a hallucination. It would, in fact, have fooled me - if I were given this input:
The following is a blog post about large language models (LLMs)
The Future Of NLP
Please answer these questions about the blog post:
What does the post say about the history of the field?
I would assume that I was supposed to invent what the blog post contained, since the input only contains what looks like a title. It seems entirely reasonable the AI would do the same, without some sort of qualifier, like "The following is the entire text of a blog post about large language models."
Essentially all of us on this particular website care about the X-risk side of things, and by far the majority of alignment content on this site is about that.
This is awesome stuff. Thanks for all your work on this over the last couple of months! When SERI MATS is over, I am definitely keen to develop some MI skills!
I agree that it is very difficult to make predictions about something that is A) Probably a long way away (Where "long" here is more than a few years) and B) Is likely to change things a great deal no matter what happens.
I think the correct solution to this problem of uncertainty is to reason normally about it but have very wide confidence intervals, rather than anchoring on 50% because X will happen or it won't.
This seems both inaccurate and highly controversial. (Controversially, this implies there is nothing that AI alignment can do - not only can we not make AI safer, we couldn't even deliberately make AI more dangerous if we tried)
Accuracy-wise, you may not be able to know much about superintelligences, but even if you were to go with a uniform prior over outcomes, what that looks like depends tremendously on the sample space.
For instance, take the following argument: When transformative AI emerges, all bets are off, which means that any particular number of humans left alive should not be a privileged hypothesis. Thus, it makes sense to consider "number of humans alive after the singularity" to be a uniform distribution between 0 and N, where N is the number of humans in an intergalactic civilisation, so the chance of humanity being wiped out is almost zero.
If we want to use only binary hypotheses instead of numerical ones, I could instead say that each individual human has a 50/50 chance of survival, meaning that when you add these together, roughly half of humanity lives and again the chance of humanity being wiped out is basically zero.
This is not a good argument, but it isn't obvious to me how its structure differs from your structure.
I notice that I'm confused about quantilization as a theory, independent of the hodge-podge alignment. You wrote "The AI, rather than maximising the quality of actions, randomly selects from the top quantile of actions."
But the entire reason we're avoiding maximisation at all is that we suspect that the maximised action will be dangerous. As a result, aren't we deliberately choosing a setting which might just return the maximised, potentially dangerous action anyway?
(Possible things I'm missing - the action space is incredibly large, the danger is not from a single maximised action but from a large chain of them)
I like this article a lot. I'm glad to have a name for this, since I've definitely used this concept before. My usual argument that invokes this goes something like:
"Humans are terrible."
"Terrible compared to what? We're better than we've ever been in most ways. We're only terrible compared to some idealised perfect version of humanity, but that doesn't exist and never did. What matters is whether we're headed in the right direction."
I realise now that this is a zero-point issue - their zero point was where they thought humans should be on the issue at hand (e.g, racism) and my zero point was the historical data for how well we've done in the past.
The zero point may also help with imposter syndrome, as well as a thing I have not named, which I now temporarily dub the Competitor's Paradox until an existing name is found.
The rule is - if you're a serious participant in a competitive endeavor, you quickly narrow your focus to only compare yourself to people who take it at least as seriously as you do. You can be a 5.0 tennis player (Very strong amateur) but you'll still get your ass kicked in open competition. You may be in the top 1% of tennis players*, but the 95-98% of players who you can clean off the court with ease never even get thought of when you ask yourself if you're "good" or not. The players who can beat you easily? They're good. This remains true no matter how high you go, until there's nobody in the world who can beat you easily, which is like, 20 guys.
So it may help our 5.0 player to say something like "Well, am I good? Depends on what you consider the baseline. For a tournament competitor? No. But for a club player, absolutely."
*I'm not sure if 5.0 is actually top 1% or not.
Thanks for making things clearer! I'll have to think about this one - some very interesting points from a side I had perhaps unfairly dismissed before.