Posts
Comments
I think 3 is close to an analogy where the LLM is using 'sacrifice' in a different way than we'd endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it - they're collaborators not adversaries - and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored).
Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It's like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can't agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.
Perhaps I should have said that it's silly to ask whether "being like A" or "being like B" is the goal of the game.
I have a different concern than most people.
An AI that follows the english text rather than the reward signal on the chess example cannot on that basis be trusted to do good things when given english text plus human approval reward in the real world. This is because following the text in simple well-defined environments is solving a different problem than following values-laden text in the real world.
The problem that you need to solve for alignment in the real world is how to interpret the world in terms of preferences when even we humans disagree and are internally inconsistent, and how to generalize those preferences to new situations in a way that does a good job satisfying those same human preferences (which include metapreferences). Absent a case for why this is happening, a proposed AI design doesn't strike me as dealing with alignment.
When interpreting human values goes wrong, the AI's internal monologue does not have to sound malevolent or deceptive. It's not thinking "How can I deceive the humans into giving me more power so I can make more paperclips?" It might be thinking "How can I explain to the humans that my proposal is what will help them the most?" Perhaps, if you can find its internal monologue describing its opinion of human preferences in detail, they might sound wrong to you (Wait a minute, I don't want to be a wirehead god sitting on a lotus throne!), or maybe it's doing this generalization implicitly, and just understands the output of the paraphraser slightly differently than you would.
Yeah, this makes sense.
You could also imagine more toy-model games with mixed ecological equilibria.
E.g. suppose there's some game where you can reproduce by getting resources, and you get resources by playing certain strategies, and it turns out there's an equilibrium where there's 90% strategy A in the ecosystem (by some arbitrary accounting) and 10% strategy B. It's kind of silly to ask whether it's A or B that's winning based on this.
Although now that I've put things like that, it does seem fair to say that A is 'winning' if we're not at equilibrium, and A's total resources (by some accounting...) is increasing over time.
Now to complicate things again, what if A is increasing in resource usage but simultaneously mutating to be played by fewer actual individuals (the trees versus pelagibacter, perhaps)? Well, in the toy model setting it's pretty tempting to say the question is wrong, because if the strategy is changing it's not A anymore at all, and A has been totally wiped out by the new strategy A'.
Actually I guess I endorse this response in the real world too, where if a species is materially changing to exploit a new niche, it seems wrong to say "oh, that old species that's totally dead now sure were winners." If the old species had particular genes with a satisfying story for making it more adaptable than its competitors, perhaps better to take a gene's-eye view and say those genes won. If not, just call it all a wash.
Anyhow, on humans: I think we're 'winners' just in the sense that the human strategy seems better than our population 200ky ago would have reflected, leading to a population and resource use boom. As you say, we don't need to be comparing ourselves to phytoplankton, the game is nonzero-sum.
The bet is indeed on. See you back here in 2029 :)
Sadly for your friend, the hottest objects in the known universe are still astronomical rather than manmade. The LHC runs on the scale of 10 TeV (10^13 eV). The Auger observatory studies particles that start at 10^18 eV and go up from there.
Ok, I'm agreeing in principle to make the same bet as with RatsWrongAboutUAP.
("I commit to paying up if I agree there's a >0.4 probability something non-mundane happened in a UFO/UAP case, or if there's overwhelming consensus to that effect and my probability is >0.1.")
I think you can do some steelmanning of the anti-flippers with something like Lara Buchak's arguments on risk and rationality. Then you'd be replacing the vague "the utility maximizing policy seems bad" argument with a more concrete "I want to do population ethics over the multiverse" argument.
I did a podcast discussion with Undark a month or two ago, a discussion with Arvind Narayanan from AI Snake Oil. https://undark.org/2024/11/11/podcast-will-artificial-intelligence-kill-us-all/
Well, that went quite well. Um, I think two main differences I'd like to see are, first, a shift in attention from 'AGI when' to more specific benchmarks/capabilities. Like, ability to replace 90% of the work of an AI researcher (can you say SWE-bench saturated? Maybe in conversation with Arvind only) when?
And then the second is to try to explicitly connect those benchmarks/capabilities directly to danger - like, make the ol' King Midas analogy maybe? Or maybe just that high capabilities -> instability and risk inherently?
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
No, that’s not how RL works. RL - in settings like REINFORCE for simplicity - provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
How does a per-datapoint learning rate modifier inherently incentivize the trained artifact to try to maximize the per-datapoint learning rate multiplier?
For readers familiar with markov chain monte carlo, you can probably fill in the blanks now that I've primed you.
For those who want to read on: if you have an energy landscape and you want to find a global minimum, a great way to do it is to start at some initial guess and then wander around, going uphill sometimes and downhill sometimes, but with some kind of bias towards going downhill. See the AlphaPhoenix video for a nice example. This works even better than going straight downhill because you don't want to get stuck in local minima.
The typical algorithm for this is you sample a step and then always take it if it's going downhill, but only take it with some probability if it leads uphill (with smaller probability the more uphill it is). But another algorithm that's very similar is to just take smaller steps when going uphill than when going downhill.
If you were never told about the energy landscape, but you are told about a pattern of larger and smaller steps you're supposed to take based on stochastically sampled directions, than an interesting question is: when can you infer an energy function that's implicitly getting optimized for?
Obviously, if the sampling is uniform and the step size when going uphill looks like it could be generated by taking the reciprocal of the derivative of an energy function, you should start getting suspicious. But what if the sampling is nonuniform? What if there's no cap on step size? What if the step size rule has cycles or other bad behavior? Can you still model what's going on as a markov chain monte carlo procedure plus some extra stuff?
I don't know, these seem like interesting questions in learning theory. If you search for questions like "under what conditions does the REINFORCE algorithm find a global optimum," you find papers like this one that don't talk about MCMC, so maybe I've lost the plot.
But anyhow, this seems like the shape of the answer. If you pick random steps to take but take bigger steps according to some rule, then that rule might be telling you about an underlying energy landscape you're doing a markov chain monte carlo walk around.
Is separate evaluation + context-insensitive aggregation actually how helpful + harmless is implemented in a reward model for any major LLM? I think Anthropic uses finetuning on a mixture of specialized training sets (plus other supervision that's more holistic) which is sort of like this but allows the model to generalize in a way that compresses the data, not just a way that keeps the same helpful/harmless tradeoff.
Anyhow, of course we'd like to use the "beneficial for humanity" goal, but sadly we don't have access to it at the moment :D Working on it.
Well, it could be like that. Seems additionally unlikely the field as a whole would be bottlnecked just because the decoder-only transformer architecture is, though :/
So a proportional vote interstate compact? :)
I like it - I think one could specify an automatic method for striking a fair bargain between states (and only include states that use that method in the bargain). Then you could have states join the compact asynchronously.
E.g. if the goal is to have the pre-campaign expected electors be the same, and Texas went 18/40 Biden in 2020 while California went 20/54 Trump in 2020, maybe in 2024 Texas assigns all its electors proportionally, while California assigns 49 electors proportionally and the remaining 5 by majority. That would cause the numbers to work out the same (plus or minus a rounding error).
Suppose Connecticut also wants to join the compact, but it's also a blue state. I think the obvious thing to do is to distribute the expected minority electors proportional to total elector count - if Connecticut has 7 electors, it's responsible for balancing 7/61 of the 18 minority electors that are being traded, or just about exactly 2 of them.
But the rounding is sometimes awkward - if we lived in a universe where Connecticut had 9 electors instead, it would be responsible for just about exactly 2.5 minority electors, which is super awkward especially if a lot of small states join and start accumulating rounding errors.
What you could do instead is specify a loss function: you take the variance of the proportion of electors assigned proportionally among the states that are on the 'majority' side of the deal, multiply that by a constant (probably something small like 0.05, but obviously you do some simulations and pick something more informed), add the squared rounding error of expected minority electors, and that's your measure for how imperfect the assignment of proportional electors to states is. Then you just pick the assignment that's least imperfect.
Add in some automated escape hatches in case of change of major parties, change of voting system, or being superseded by a more ambitious interstate compact, and bada bing.
Probably? But I'll feel bad if I don't try to talk you out of this first.
- It's true that alien sightings, videos of UFOs, etc. are slowly accumulating evidence for alien visitors, even if each item has reasonable mundane excuses (e.g. 'the mysterious shape on the infrared footage was probably just a distant plane or missile' or 'the eyewitness probably lost track of time but has doubled-down on being confident they didn't'). However, all the time that passes without yet-stronger evidence for aliens is evidence against alien visitors.
- You could imagine aliens landing in the middle of the Superbowl, or sending us each a messenger drone, or the US government sending alien biological material to 50 different labs composed of hundreds of individual researchers, who hold seminars on what they're doing that you can watch on youtube. Every year nothing like this happens puts additional restrictions on alien-visitor hypotheses, which I think outweigh the slow trickle of evidence from very-hard-to-verify encounters. Relative to our informational state in the year 2000, alien visitors actually seem less likely to me.
- Imagine someone making a 5-year bet on whether we'd have publicly-replicable evidence of alien visitors, every 5 years since the 1947 Roswell news story. Like, really imagine losing this bet 15 times in a row. Even conditional on the aliens being out there, clearly there's a pretty good process keeping the truth from getting out, and you should be getting more and more confident that this process won't break down in the next 5 years.
- Even if we're in a simulation, there is no particular reason for the simulators to be behind UAP.
- Like, suppose you're running an ancestor simulation of Earth. Maybe you're a historian interested in researching our response to perturbations to the timeline that you think really could have happened, or maybe you're trying to recreate a specific person to pull out of the simulation, or you're self-inserting to have lots of great sex, or self-inserting along with several of your friends to play some sort of decades-long game. Probably you have much better things to do with this simulation than inserting some hard-to-verify floating orbs into the atmosphere.
- There is a 'UFO entertainment industry' that is creating an adverse information environment for us.
- E.g. Skinwalker Ranch is a place where some people have seen spooky stuff. But it's also a way to sell books and TV shows and get millions-of-dollars government grants, each of which involves quite a few people whose livelihood now depends not only on the spookiness of this one place, but more generally on how spooky claims are treated by the public and the US government.
- There's an analogy here to the NASA Artemis program, which involves a big web of contractors whose livelihoods depended on the space shuttle program. These contractors, and politicians working with them, and government managers who like being in charge of larger programs, all benefit from what we might call an "adverse information environment" regarding how valuable certain space programs are, how well they'll work, and how much they'll cost.
Thanks for the reply :)
So I agree it would be an advance, but you could solve inner alignment in the sense of avoiding mesaoptimizers, yet fail to solve inner alignment in the senses of predictable generalization or stability of generalization across in-lifetime learning.
This seems worryingly like "advance capabilities now, figure out the alignment later." The reverse would be to put the upfront work into alignment desiderata, and decide on a world model later.
Yes, I know every big AI project is also doing the former thing, and doing the latter thing is a much murkier and more philosophy-ridden problem.
As someone who took the bet last time, I feel like I have a little standing to ask you: Why?
Like, a year ago there were various things in the news that raised both the saliency of UFOs and the probability of weird explanations for them (conditional on certain models of the world). There was even a former intelligence officer testifying to congress that they had heard from a guy that the USG had alien bodies stashed away.
Since then we have heard exactly bupkis, as well predicted by the "He probably exaggerated, eyewitness testimony lol" hypothesis, and less well predicted by the "Aliens are visiting our planet a lot" hypothesis. More importantly, UFOs/UAP aren't being talked about nearly as much. So I'm surprised you're offering a bet now both from an evidential and statistical perspective.
Sure, there's as always a steady trickle of videos showing something mysterious moving around, but with a similarly steady trickle of mundane explanations. Is there something in particular that's changed your mind recently?
Well, it makes things better. But it doesn't assure humanity's success by any means. Basically I agree but will just redirect you back to my analogy about why the paper "How to solve nuclear reactor design" is strange.
I don't think Joe is proposing we find an AI design that is impossible to abuse even by malicious humans. The point so far seems to be making sure your own AI is not going to do some specific bad stuff.
If you solve the latter, you have not solved the former at all; if you solve the former, someone will solve the latter.
Insofar as this is true in your extended analogy, I think that's a reflection of "completely proliferation-proof reactor" being a bad thing to just assume you can solve.
Here's an admittedly-uncharitable analogy that might still illuminate what I think is lacking from this picture:
Suppose someone in 1947 wrote a paper called "How we could solve nuclear reactor design," and it focused entirely on remotely detecting nuclear bombs, modeling piles of uranium to predict when they might be nuclear bombs, making modifications to piles of uranium to make them less like nuclear bombs, ultrasound scanning piles of uranium to see if they have bomb-like structures hidden inside, etc.
Then in a subsection, they touch on how to elicit electricity from uranium, and they say that many different designs could work, just boil water and drive a turbine or something, people will work out the details later if they don't blow themselves up with a nuclear bomb. And that the nuclear reactor design solution they mean isn't about building some maximum efficiency product, it's about getting any electricity at all without blowing yourself up.
This paper seems strange because all the stuff people have to figure out later is most of the nuclear reactor design. What's the fuel composition and geometry? What's the moderator material and design? How is radiation managed? Pressure contained? Etc.
Or to leave the analogy, it seems like building safe, useful, and super-clever AI has positive content; if you start from the inverse of an AI that does the bad takeover behavior, you still have most of the work left to go. How to initialize the AI's world model and procedure for picking actions? What to get human feedback on, and how to get that feedback? How to update or modify the AI in response to human feedback? Maybe even how to maintain transparency to other human stakeholders, among other social problems? These decisions are the meat and potatoes of building aligned AI, and they have important consequences even if you've ruled out bad takeover behavior.
If you make a nuclear reactor that's not a bomb, you can still get radiation poisoning, or contaminate the groundwater, or build a big pile of uranium that someone else can use to make a bomb, or just have little impact and get completely eclipsed by the next reactor project.
Sumilarly, an AI that doesn't take over can still take actions optimized to achieve some goal humans wouldn't like, or have a flawed notion of what it is humans want from it (e.g. generalized sycophancy), or have negative social impacts, or have components that someone will use for more-dangerous AI, or just have low impact.
The more standard way to do this is with objects called distributions which you can loosely think of as "things you can integrate to get a function."
Thanks for the detailed post!
I personally would have liked to see some mention of the classic 'outer' alignment questions that are subproblems of robustness and ELK. E.g. What counts as 'generalizing correctly'? -> How do you learn how humans want the AI to generalize? -> How do you model humans as systems that have preferences about how to model them?
Nice, especially the second half.
To me, this post makes it seem like the natural place to push forward is to try and restrict incoherent policies over lotteries until they're just barely well-behaved - either on the behavior end (continuity etc) or the algorithmic end (compte budgets etc).
The Way Things Work was formative for me, and they just came out with a new edition. It's just simple, enjoyably-illustrated explanations of mechanics, simple machines, complex machines, electronics, etc.
Pointing UVC LEDs at your ceiling seems sketchy. White paint will likely scatter ~5% of UVC, and shiny metal surfaces will scatter more. Try to go below 250nm for reduced reflection (and reduced penetration into human skin) and (more) unwanted chemistry will start happening to the air.
I guess an important question is whether UVC is more harmful than UVB. If it's not any more harmful, then as long as nobody's getting sunburned from being in that room all day, it's probably fine - that 5% scattering is just another name for SPF 20. But if it is more harmful, then sunburn might not be an adequate signal for when it's bad for you.
I'm unsure what you're either expecting or looking for here.
There does seem to be a clear answer, though - just look at Bing chat and extrapolate. Absent "RL on ethics," present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.
Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it's hard to tell how big a deal this will have been.
I'm still confused about the amnesia. My memory seems pretty good at recalling an episodic memory given very partial information - e.g. some childhood memory based on verbal cues, or sound, or smell. I can recall the sight of my childhood treehouse despite currently also being exposed to visual stimuli nothing like that treehouse.
On this intuition, it seems like DID amnesia should require more 'protections', changes to the memory recall process that impede recall more than normal.
Yes, Bob is right. Because the probability is not a property of the coin. It's 'about' the coin in a sense, but it also depends on Bob's knowledge, including knowledge about location in time (Dave) or possible worlds (Carol).
The question "What is the probability of Heads?" is about the coin, not about your location in time or possible worlds.
This is, I think, the key thing that those smart people disagree with you about.
Suppose Alice and Bob are sitting in different rooms. Alice flips a coin and looks at it - it's Heads. What is the probability that the coin is Tails? Obviously, it's 0% right? That's just a fact about the coin. So I go to Bob in the other room and and ask Bob what's the probability the coin is Tails, and Bob tells me it's 50%, and I say "Wrong, you've failed to know a basic fact about the coin. Since it was already flipped the probability was already either 0% or 100%, and maybe if you didn't know which it was you should just say you can't assign a probability or something."
Now, suppose there are two universes that differ only by the polarization of a photon coming from a distant star, due to hit Earth in a few hours. And I go into the universe where that polarization is left-handed (rather than right-handed), and in that universe the probability that the photon is right-handed is 0% - it's just a fact about the photon. So I go to the copy of Carol that lives in this universe and ask Carol what's the probability the photon has right-handed polarization, and Carol tells me it's 50%, and I say "Wrong, you've failed to know a basic fact about the photon. Since it's already on its way the probability was already either 0% or 100%, and maybe if you don't know which it was you should just say you can't assign a probability or something."
Now, suppose there are two universes that differ outside of the room that Dave is currently in, but are the same within Dave's room. Say, in one universe all the stuff outside the room is arranged is it is today in our universe, while in the other universe all the stuff outside the room is arranged as it was ten years ago. And I go into the universe where all the stuff outside the room is arranged as it was ten years ago, which I will shorthand as it being 2014 (just a fact about calendars, memories, the positions of galaxies, etc.), and ask Dave what's the probability that the year outside is 2024, and Dave tells me it's 50%...
Do you have an example for domains where ground truth is unavailable, but humans can still make judgements about what processes are good to use?
Two very different ones are ethics and predicting the future.
Have you ever heard of a competition called the Ethics Bowl? They're a good source of questions with no truth values you could record in a dataset. E.g. "How should we adjudicate the competing needs of the various scientific communities in the case of the Roman lead?" Competing teams have to answer these questions and motivate their answers, but it's not like there's one right answer the judges are looking for and getting closer to that answer means a higher score.
Predicting the future seems like you can just train it based on past events (which have an accessible ground truth), but what if I want to predict the future of human society 10,000 years from now? Here there is indeed more of "a model about the world that implies a certain truth-finding method will be successful," which we humans will use to judge an AI trying to predict the far future - there's some comparison being made, but we're not comparing the future-predicting AI's process to the universal ontology because we can't access the universal ontology.
Positing a universal ontology doesn't seem to do much for us, since we can't access it. Somehow us humans still think we can judge things as true (or on a more meta level, we think we can judge that some processes are better at finding truth than others), despite being cut off from the universal ontology.
This seems hopeful. Rather than, as you say, trying to generalize straight from 'thought experiments where ground truth is available' to the full difficulties of the real world, we should try to understand intermediate steps where neither the human nor the AI know any ground truth for sure, but humans still can make judgments about what processes are good.
I'll give a somewhat aspirational answer:
A (local-definition) rationalist is someone who takes an interest in how the mind works and how it errs, and then takes seriously that this applies to themselves.
Amen. Untied weights are a weird hack. Problem is, they're a weird hack that, if you take it away, you have a lot less sparsity in your SAEs on real problems.
Now, to some extent you might want to say "well then you should accept that your view of model representations was wrong rather than trying to squeeze them onto a procrustean bed" but also also, the features found using untied weights are mostly interpretable and useful.
So another option might be to say "Both tied weights and untied weights are actually the wrong inference procedure for sparse features, and we need to go back to Bayesian methods or something."
Is this a fair summary (from a sort of reverse direction)?
We start with questions like "Can GR and QM be unified?" where we sort of think we know what we mean based on a half-baked, human understanding both of the world and of logic. If we were logically omniscient we could expound a variety of models that would cash out this human-concept-space question more precisely, and within those models we could do precise reasoning - but it's ambiguous how our real world half-baked understanding actually corresponds to any given precise model.
Control also makes AI more profitable, and more attractive to human tyrants, in worlds where control is useful. People want to know they can extract useful work from the AIs they build, and if problems with deceptiveness (or whatever control-focused people think the main problem is) are predictable, it will be more profitable, and lead to more powerful AU getting used, if there are control measures ready to hand.
This isn't a knock-down argument against anything, it's just pointing out that inherent dual use of safety research is pretty broad - I suspect it's less obvious for AI control simply because AI control hasn't been useful for safety yet.
If you don't just want the short answer of "probably LTFF" and want a deeper dive on options, Larks' review is good if (at this point) dated.
Suppose you're an AI company and you build a super-clever AI. What practical intellectual work are you uniquely suited to asking the AI to do, as the very first, most obvious thing? It's your work, of course.
Recursive self-improvement is going to happen not despite human control, but because of it. Humans are going to provide the needed resources and direction at every step. Expecting everyone to pause just when rushing ahead is easiest and most profitable, so we can get some safety work done, is not going to pan out barring nukes-on-datacenters level of intervention.
Faster can still be safer if compute overhang is so bad that we're willing to spend years of safety research to reduce it. But it's not.
I am confused about the purpose of the "awareness and memory" section, and maybe disagree with the intuition said to be obvious in the second subsection. Is there some deeper reason you want to bring up how we self-model memory / something you wanted to talk about there that I missed?
On the other hand, I can’t have two songs playing in my head simultaneously,
Tangent: I play at irish sessions, and one of the things you have to do there is swap tunes. If you lead a transition you have to be imagining the next tune you're going to play at the same time as you're playing the current tune. In fact, often you have to decide on the next tune on the fly. This us a skill that takes some time to grok. You're probably conceptualizing the current and future tunes differently, but there's still a lot of overlap - you have to keep playing in sync with other people the entire time, while at the same time recalling and anticipating the future tune.
This is related to the question "Are human values in humans, or are they in models of humans?"
Suppose you're building an AI to learn human values and apply them to a novel situation.
The "human values are in humans", ethos is that the way humans compute values is the thing AI should learn, and maybe it can abstract away many kinds of noise, but it shouldn't be making any big algorithmic overhauls. It should just find the value-computation inside the human (probably with human feedback) and then apply it to the novel situation.
The "human values are in models of humans" take is that the AI can throw away a lot of information about the actual human brain, and instead should find good models (probably with human feedback) that have "values" as a component of a coarse-graining of human psychology, and then apply those "good" models to the novel situation.
Here are some different things you might have clustered as "alignment tax."
Thing 1: The difference between the difficulty of building the technologically-closest friendly transformative AI and the technologically-closest dangerous transformative AI.
Thing 2: The expected difference between the difficulty of building likely transformative AI conditional on it being friendly and the difficulty of building likely transformative AI no matter friendly or dangerous.
Thing 3: The average amount that effort spent on alignment detracts from the broader capability or usefulness of AI.
Turns out it's possible to have negative Thing 3 but positive Things 1 and 2. This post seems to call such a state of affairs "optimistic," which is way too hasty.
An opposite-vibed way of framing the oblique angle between alignment and capabilities is as "unavoidable dual use research." See this long post about the subject
Yeah, I'm not actually sure about the equilibrium either. I just noticed that not privileging any voters (i.e. the pure strategy of 1/3,1/3,1/3) got beaten by pandering, and by symmetry there's going to be at least a three-part mixed Nash equilibrium - if you play 1/6A 5/6B, I can beat that with 1/6B 5/6C, which you can then respond to with 1/6C 5/6A, etc.
Yeah, "3 parties with cyclic preferences" is like the aqua regia of voting systems. Unfortunately I think it means you have to replace the easy question of "is it strategy-proof" with a hard question like "on some reasonable distribution of preferences, how much strategy does it encourage?"
Epistemic status: I don't actually understand what strategic voting means, caveat lector
Suppose we have three voters, one who prefers A>B>C, another B>C>A, the third C>A>B. And suppose our preferences are that the middle one is 0.8 (on a scale where the top one is 1 and bottom 0).
Fred and George will be playing a mixed Nash equilibrium where they randomize between catering to A, B, or C preferrers - treating them as a black box the result will be 1/3 chance of each, all voters get utility 0.59.
But suppose I'm the person with A>B>C, and I can predict how the other people will vote. Should I change my vote to get a better result? What happens if I vote B>C>A, putting my own favorite candidate at the bottom of the list? Well, now the Nash equilibrium for Fred and George is 100% B, because 2 the C preferrer is outvoted, and I'll get utility 0.8, so I should vote strategically.
Just riffing on this rather than starting a different comment chain:
If alignment is "get AI to follow instructions" (as typically construed in a "good enough" sort of way) and alignment is "get AI to do good things and not bad things," (also in a "good enough" sort of way, but with more assumed philosophical sophistication) I basically don't care about anyone's safety plan to get alignment except insofar as it's part of a plan to get alignment.
Philosophical errors/bottlenecks can mean you don't know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for "nebulous future safety case for alignment," which is totally fine. I just also want a space for "nebulous future safety case for alignment" at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment takes (will it focus on the structure of the institution using an aligned AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power - that we don't even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.
I am confused by the apparent expectation by many people both pro- and anti- that $1000 per month was going to cause poor people age 20-40 to have way better overall health and way higher healthcare spending.
Maybe there's some mental model going around that a key thing poor people really need to spend their next marginal dollar on is their health, and if they do it will be impressively effective, but if they don't then the poor people have caused UBI to fail by not spending their money rationally? This is surprising to me on all counts, but perhaps I haven't seen the same evidence about healthcare spending that segment of the predictors have.
I'll admit, my mental image for "our universe + hypercomputation" is a sort of webnovel premise, where we're living in a normal computable universe until one day by fiat an app poofs into existence on your phone that lets you enter a binary string or file and instantaneously get the next bit with minimum description length in binary lambda calculus. Aside from the initial poofing and every usage of the app, the universe continues by its normal rules.
But there's probably simpler universes (by some handwavy standard) out there that allow enough hypercomputation that they can have agents querying minimum description length oracles, but not so much that agents querying MDL oracles can no longer be assigned short codes.
I agree and yet I think it's not actually that hard to make progress.
There is no canonical way to pick out human values,[1] and yet using an AI to make clever long-term plans implicitly makes some choice. You can't dodge choosing how to interpret humans, if you think you're dodging it you're just doing it in an unexamined way.
Yes, humans are bad at philosophy and are capable of making things worse rather than better by examining them. I don't have much to say other than get good. Just kludging together how the AI interprets humans seems likely to lead to problems to me, especially in a possible multipolar future where there's more incentive for people to start using AI to make clever plans to steer the world.
This absolutely means disposing of appealing notions like a unique CEV, or even an objectively best choice of AI to build, even as we make progress on developing standards for good AI to build.
- ^
See the Reducing Goodhart sequence for me on this, which starts sketching some ways to deal with humans not being agents.