Posts
Comments
I agree, after all RLFH was originally for RL agents. As long as the models aren't all that smart, and the tasks they have to do aren't all that long-term, the transfer should work great, and the occasional failure won't be a problem because, again, the models aren't all that smart.
To be clear, I don't expect a 'sharp left turn' so much as 'we always implicitly incentivized exploitation of human foibles, we just always caught it when it mattered, until we didn't.'
Oh dear, RL for everything, because surely nobody's been complaining about the safety profile of doing RL directly on instrumental tasks rather than on goals that benefit humanity.
It likely means running the AI many times and submitting the most common answer from the AI as the final answer.
Skimmed ahead to the alignment section, since that's my interest. Some thoughts:
My first reaction to the perennial idea of giving everyone their own AI advocate is that this is a bargaining solution that could just as well be simulated within a single centralized agent, if one actually knew how to align AIs to individual people. My second typical reaction is that if one isn't actually doing value alignment to individuals, and is instead just giving people superpowerful AI assistants and seeing what happens, that seems like an underestimation of the power of superintelligence to cause winner-take-all dynamics in contexts where there are resources waiting to be grabbed, or where the laws of nature happen to favor offense over defense.
You anticipate these thoughts, which is nice!
My typical reaction to things like AI Rules is that they essentially have to contain a solution to the broad/ambitious value alignment problem anyway in order to work, so why not cut out the middlemen of having mini-'goals' overseen by aligned-goal-containing Rules and just build the AI that does what's good.
You agree with the first part. I think where you disagree with the second part is that you think that if we oversee the AIs and limit the scope of their tasks, we can get away with leaky or hacky human values in the Rules in a way we couldn't get away with if we tried to just directly get an AI to fulfill human values without limitations. I worry that this still underestimates superintelligence - even tame-seeming goals from users can test all the boundaries you've tried to set, and any leaks ('Oh, I can't lie to the human, but I can pick a true thing to say which I predict will get the outcome that I think fulfills the goal, and I'm very good at that') in those boundaries will be flowed through in a high-pressure stream.
If there's an assumption I skimmed past that the AI assistants we give everyone won't actually be very smart, or will have goals restricted to the point that it's hard for superintelligence to do anything useful, I think this puts the solution back into the camp of "never actually use a superintelligent AI to make real-world plans," which is nice to aspire to but I think has a serious human problem, and anyhow I'm still interested in alignment plans that work on AI of arbitrary intelligence.
What a nice hopeful story about how good will always triumph because evil is dumb (thanks Spaceballs).
Re (2) it may also be recomputed if the LLM reads that same text later. Or systems operating in the real world might just keep a long context in memory. But I'll drop this, because maintaining state or not seems somewhat irrelevant.
(1) Yep, current LLM systems are pretty good. I'm not very convinced about generalization. It's hard to test LLMs on outside distribution problems because currently they tend to just give dumb answers that aren't that interesting.
(Thinking of some guy who was recently hyped about asking o1 for the solution to quantum gravity - it gave the user some gibberish that he thought looked exciting, which would have been a good move in the RL training environment where the user has a reward button, but is just totally disconnected from how you need to interact with the real world.)
But in a sense that's my point (well, plus some other errors like sycophancy) - the reasons a present-day LLM uses a word can often be shown to generalize in some dumb way when you challenge it with a situation that the model isn't well-suited for. This can be true at the same time it's true that the model is pretty good at morality on the distribution it is competent over. This is still sufficient to show that present systems generalize in some amoral ways, and if we probably disagree about future ststems, this likely comes down to classic AI safetyist arguments about RL incentivizing deceiving of the user as the world-model gets better.
I agree LLMs are a huge step forward towards getting AIs to do human-level moral reasoning, even if I don't agree that we're literally done. IMO the really ambitious 'AI safetyists' should now be asking what it means to get superhuman-level moral reasoning (there's no such thing as perfect, but there sure is better than current LLMs), and how we could get there.
And sadly, just using an LLM as part of a larger AI, even one that reliably produces moral text on the everyday distribution, does not automatically lead to good outcomes, so there's still a role for classical AI safetyism.
-
You could have an ontology mismatch between different parts of the planning process that degrades the morality of the actions. Sort of like translating the text into a different language where the words have different connotations.
-
You could have the 'actor' part of the planning process use out-of-distribution inputs to get immoral-but-convenient behavior past the 'critic'.
-
You could have a planning process that interfaces with the LLM using activations rather than words, and this richer interface could allow RL to easily route around morality and just use the LLM for its world-model.
Yes, because it's wrong. (1) because on a single token a LLM might produce text for reasons that don't generalize like a sincere human answer would (e.g. the examples from the contrast-consistent search where certain false answers systematically differ from true answers along some vector), and (2) because KV cacheing during inference will preserve those reasons so they impact future tokens.
Taking AI companies that are locally incentivized to race toward the brink, and then hoping they stop right at the cliff's edge, is potentially a grave mistake.
One might hope they stop because of voluntary RSPs, or legislation setting a red line, or whistleblowers calling in the government to lock down the datacenters, or whatever. But just as plausible to me is corporations charging straight down the cliff (of building ever-more-clever AI as fast as possible until they build one too clever and it gets power and does bad things to humanity), and even strategizing ahead of time how to avoid obstacles like legislation telling them not to. Local incentives have a long history of dominating people in this way, e.g. people in the tobacco and fossil fuel industries
What would be so much safer is if even the local incentives of cutting-edge AI companies favored social good, alignment to humanity, and caution. This would require legislation blocking off a lot of profitable activity, plus a lot of public and philanthropic money incentivizing beneficial activity, in a convulsive effort whose nearest analogy is the global shift to renewable energy.
(this take is the one thing I want to boost from AI For Humanity.)
Sorry, on my phone for a few days, but iirc in ch. 3 they consider the loss you get if you just predict according to the simplest hypothesis that matches the data (and show it's bounded).
You can also phrase the "is consciousness functional" issue as the existence or non-existence of bridging laws (if consciousness is functional, then there are no bridging laws). Which actually also means that Solomonoff Induction privileges consciousness being functional, all else equal.
Just imagine using your own subjective experience as the input to Solomonoff induction. If you have subjective experience that's not connected by bridging laws to the physical world, Solomonoff induction is happy to try to predict its patterns anyhow.
Solomonoff induction only privileges consciousness being functional if you actually mean schmonsciousness.
If consciousness is not functional, then Solomonoff induction will not predict it for other people even if you assert it for yourself. This is because "asserting it for yourself" doesn't have a functional impact on yourself, so there's no need to integrate it into the model of the world - it can just be a variable set to True a priori.
As I said, if you use induction to try to predict your more fine-grained personal experience, then the natural consequence (if the external world exists) is that you get a model of the external world plus some bridging laws that say how you experience it. You are certainly allowed to try to generalize these bridging laws to other humans' brains, but you are not forced to, it doesn't come out as an automatic part of the model.
I'm on board with being realist about your own consciousness. Personal experience and all that. But there's an epistemological problem with generalizing - how are you supposed to have learned that other humans have first-person experience, or indeed that the notion of first-person experience generalizes outside of your own head?
In Solomonoff induction, the mathematical formalization of Occam's razor, it's perfectly legitimate to start by assuming your own phenomenal experience (and then look for hypotheses that would produce that, such as the external world plus some bridging laws). But there's no a priori reason those bridging laws have to apply to other humans. It's not that they're assumed to be zombies, there just isn't a truth of the matter that needs to be answered.
To solve this problem, let me introduce you to schmonciousness, the property that you infer other people have based on their behavior and anatomy. You're conscious, they're schmonscious. These two properties might end up being more or less the same, but who knows.
Where before one might say that conscious people are moral patients, now you don't have to make the assumption that the people you care about are conscious, and you can just say that schmonscious people are moral patients.
Schmonsciousness is very obviously a functional property, because it's something you have to infer about other people (you can infer it about yourself based on your behavior as well, I suppose). But if consciousness is different from schmonsciousness, you still don't have to be a functionalist about consciousness.
I dunno, I think you can generalize reward farther than behavior. E.g. I might very reasonably issue high reward for winning a game of chess, or arriving at my destination safe and sound, or curing malaria, even if each involved intermediate steps that don't make sense as 'things I might do.'
I do agree there are limits to how much extrapolation we actually want, I just think there's a lot of headroom for AIs to achieve 'normal' ends via 'abnormal' means.
This safety plan seems like it works right up until you want to use an AI to do something you wouldn't be able to do.
If you want a superhuman AI to do good things and not bad things, you'll need a more direct operationalization of good and bad.
Temperature 0 is also sometimes a convenient mathematical environment for proving properties of Solomonoff induction, as in Li and Vitanyi (pdf of textbook).
I read Fei-Fei Li's autobiographical book (The Worlds I See). I give it a 'imagenet wasn't really an adventure story, so you'd better be interested in it intrinsically and also want to hear about the rest of Fei-Fei Li's life story.' out of 5.
She's somewhat coy about military uses, how we're supposed to deal with negative social impacts, and anything related to superhuman AI. I can only point to the main vibe, which is 'academic research pointing out problems is vital, I sure hope everything works out after that.'
What I'm going to say is that I really do mean phenomenal consciousness. The person who turns off the alarm not realizing it's an alarm, poking at the loud thing without understanding it, is already so different from my waking self. And those are just the ones that I remember - the shape of the middle of the distribution implies the existence of an unremembered tail of the distribution.
If I'm sleeping dreamlessly, and take a reflexive action such as getting goosebumps, am I having a kinesthetic experience? If you say yes here, then perhaps there is no mystery and you just use 'experience' idiosyncratically.
But are you having a raw experience of looking at this image? The answer to this question is not up to interpretation in the same way. You can’t be wrong about the claim “you are having a visual experience”.
Sometimes when I set an alarm, I turn it off and go back to sleep (oops!). Usually I remember what happened, and I have a fairly wide range of mental states in these memories - typically I am aware that it's an alarm, and turn it off more or less understanding what's going on, even if I'm not always making a rational decision. Rarely, I don't understand that it's an alarm at all, and afterwards just remember that in the early morning I was fumbling with some object that made noise. And a similar fraction of the time, I don't remember turning off the alarm at all! I wonder what kind of processes animate me during those times.
Suppose turning off my alarm involved pressing a button labeled 'I am having conscious experience.' I think that whether this would be truth or lie, in those cases I have forgotten, would absolutely be up to interpretation.
If you disagree, and think that there's some single correct criterion for whether I'm conscious or not when the button gets pressed, but you can't tell me what it is and don't have a standard of evidence for how to find it, then I'm not sure how much you actually disagree.
Euan seems to be using the phrase to mean (something like) causal closure (as the phrase would normally be used e.g. in talking about physicalism) of the upper level of description - basically saying every thing that actually happens makes sense in terms of the emergent theory, it doesn't need to have interventions from outside or below.
Nah, it's about formalizing "you can just think about neurons, you don't have to simulate individual atoms." Which raises the question "don't have to for what purpose?", and causal closure answers "for literally perfect simulation."
Causal closure is impossible for essentially every interesting system, including classical computers (my laptop currently has a wiring problem that definitely affects its behavior despite not being the sort of thing anyone would include in an abstract model).
Are there any measures of approximate simulation that you think are useful here? Computer science and nonlinear dynamics probably have some.
I think it's possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
Yeah, that's true. The obvious way is you could have optimized micro, but that's kinda boring. More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of 'better at minecraft.'
[what do you mean by preference conflict?]
I mean it in a way where the preferences are modeled a little better than just "the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence." Sometimes humans appear to act according to fairly straightforward models of goal-directed action. However, the precise model, and the precise goals, may be different at different times (or with different modeling hyperparameters, and of course across different people) - and if you tried to model the human well at all the different times, you'd get a model that looked like physiology and lost the straightforward talk of goals/preferences
Resolving preference conflicts is the process of stitching together larger preferences out of smaller preferences, without changing type signature. The reason literally-interpreted-sentences doesn't really count is because interpreting them literally is using a smaller model than necessary - you can find a broader explanation for the human's behavior in context that still comfortably talks about goals/preferences.
Fair enough.
Yes, it seems totally reasonable for bounded reasoners to consider hypotheses (where a hypothesis like 'the universe is as it would be from the perspective of prisoner #3' functions like treating prisoner #3 as 'an instance of me') that would be counterfactual or even counterlogical for more idealized reasoners.
Typical bounded reasoning weirdness is stuff like seeming to take some counterlogicals (e.g. different hypotheses about the trillionth digit of pi) seriously despite denying 1+1=3, even though there's a chain of logic connecting one to the other. Projecting this into anthropics, you might have a certain systematic bias about which hypotheses you can consider, and yet deny that that systematic bias is valid when presented with it abstractly.
This seems like it makes drawing general lessons about what counts as 'an instance of me' from the fact that I'm a bounded reasoner pretty fraught.
I think it doesn't actually work for the repugnant conclusion - the buttons are supposed to just purely be to the good, and not have to deal with tradeoffs.
Once you start having to deal with tradeoffs, then you get into the aesthetics of population ethics - maybe you want each planet in the galaxy to have a vibrant civilization of happy humans, but past that more happy humans just seems a bit gauche - i.e. there is some value past which the raw marginal utility of cramming more humans into the universe is negative. Any a button promising an existing human extra life might be offered, but these humans are all immortal if they want to be anyhow, and their lives are so good it's hard to identify one-size-fits-all benefits one could even in theory supply via button, without violating any conservation laws.
All of this is a totally reasonable way to want the future of the universe to be arranged, incompatible with the repugnant conclusion. And still compatible with rejecting the person-affecting view, and pressing the offered buttons in our current circumstances.
Suppose there are a hundred copies of you, in different cells. At random, one will be selected - that one is going to be shot tomorrow. A guard notifies that one that they're going to be shot.
There is a mercy offered, though - there's a memory-eraser-ray handy. The one who knows they'te going to be shot is given the option to erase their memory of the warning and everything that followed, putting them in the same information state, more or less, as any of the other copies.
"Of course!" They cry. "Erase my memory, and I could be any of them - why, when you shoot someone tomorrow, there's a 99% chance it won't even be me!"
Then the next day comes, and they get shot.
I do like the idea of having "model organisms of alignment" (notably different than model organisms of misalignment)
Minecraft is a great starting point, but it would also be nice to try to capture two things: wide generalization, and inter-preference conflict resolution. Generalization because we expect future AI to be able to take actions and reach outcomes that humans can't, and preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm).
I did not believe this until I tried it myself, so yes, very paradox, much confusing.
Seemed well-targeted to me. I do feel like it could have been condensed a bit or had a simpler storyline or something, but it was still a good post.
I think 3 is close to an analogy where the LLM is using 'sacrifice' in a different way than we'd endorse on reflection, but why should it waste time rationalizing to itself in the CoT? All LLMs involved should just be fine-tuned to not worry about it - they're collaborators not adversaries - and so collapse to the continuum between cases 1 and 2, where noncentral sorts of sacrifices get ignored first, gradually until at the limit of RL all sacrifices are ignored).
Another thing to think about analogizing is how an AI doing difficult things in the real world is going to need to operate domains that are automatically noncentral to human concepts. It's like if what we really cared about was whether the AI would sacrifice in 20-move-long sequences that we can't agree on a criterion to evaluate. Could try a sandwiching experiment where you sandwich both the pretraining corpus and the RL environment.
Perhaps I should have said that it's silly to ask whether "being like A" or "being like B" is the goal of the game.
I have a different concern than most people.
An AI that follows the english text rather than the reward signal on the chess example cannot on that basis be trusted to do good things when given english text plus human approval reward in the real world. This is because following the text in simple well-defined environments is solving a different problem than following values-laden text in the real world.
The problem that you need to solve for alignment in the real world is how to interpret the world in terms of preferences when even we humans disagree and are internally inconsistent, and how to generalize those preferences to new situations in a way that does a good job satisfying those same human preferences (which include metapreferences). Absent a case for why this is happening, a proposed AI design doesn't strike me as dealing with alignment.
When interpreting human values goes wrong, the AI's internal monologue does not have to sound malevolent or deceptive. It's not thinking "How can I deceive the humans into giving me more power so I can make more paperclips?" It might be thinking "How can I explain to the humans that my proposal is what will help them the most?" Perhaps, if you can find its internal monologue describing its opinion of human preferences in detail, they might sound wrong to you (Wait a minute, I don't want to be a wirehead god sitting on a lotus throne!), or maybe it's doing this generalization implicitly, and just understands the output of the paraphraser slightly differently than you would.
Yeah, this makes sense.
You could also imagine more toy-model games with mixed ecological equilibria.
E.g. suppose there's some game where you can reproduce by getting resources, and you get resources by playing certain strategies, and it turns out there's an equilibrium where there's 90% strategy A in the ecosystem (by some arbitrary accounting) and 10% strategy B. It's kind of silly to ask whether it's A or B that's winning based on this.
Although now that I've put things like that, it does seem fair to say that A is 'winning' if we're not at equilibrium, and A's total resources (by some accounting...) is increasing over time.
Now to complicate things again, what if A is increasing in resource usage but simultaneously mutating to be played by fewer actual individuals (the trees versus pelagibacter, perhaps)? Well, in the toy model setting it's pretty tempting to say the question is wrong, because if the strategy is changing it's not A anymore at all, and A has been totally wiped out by the new strategy A'.
Actually I guess I endorse this response in the real world too, where if a species is materially changing to exploit a new niche, it seems wrong to say "oh, that old species that's totally dead now sure were winners." If the old species had particular genes with a satisfying story for making it more adaptable than its competitors, perhaps better to take a gene's-eye view and say those genes won. If not, just call it all a wash.
Anyhow, on humans: I think we're 'winners' just in the sense that the human strategy seems better than our population 200ky ago would have reflected, leading to a population and resource use boom. As you say, we don't need to be comparing ourselves to phytoplankton, the game is nonzero-sum.
The bet is indeed on. See you back here in 2029 :)
Sadly for your friend, the hottest objects in the known universe are still astronomical rather than manmade. The LHC runs on the scale of 10 TeV (10^13 eV). The Auger observatory studies particles that start at 10^18 eV and go up from there.
Ok, I'm agreeing in principle to make the same bet as with RatsWrongAboutUAP.
("I commit to paying up if I agree there's a >0.4 probability something non-mundane happened in a UFO/UAP case, or if there's overwhelming consensus to that effect and my probability is >0.1.")
I think you can do some steelmanning of the anti-flippers with something like Lara Buchak's arguments on risk and rationality. Then you'd be replacing the vague "the utility maximizing policy seems bad" argument with a more concrete "I want to do population ethics over the multiverse" argument.
I did a podcast discussion with Undark a month or two ago, a discussion with Arvind Narayanan from AI Snake Oil. https://undark.org/2024/11/11/podcast-will-artificial-intelligence-kill-us-all/
Well, that went quite well. Um, I think two main differences I'd like to see are, first, a shift in attention from 'AGI when' to more specific benchmarks/capabilities. Like, ability to replace 90% of the work of an AI researcher (can you say SWE-bench saturated? Maybe in conversation with Arvind only) when?
And then the second is to try to explicitly connect those benchmarks/capabilities directly to danger - like, make the ol' King Midas analogy maybe? Or maybe just that high capabilities -> instability and risk inherently?
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
No, that’s not how RL works. RL - in settings like REINFORCE for simplicity - provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
How does a per-datapoint learning rate modifier inherently incentivize the trained artifact to try to maximize the per-datapoint learning rate multiplier?
For readers familiar with markov chain monte carlo, you can probably fill in the blanks now that I've primed you.
For those who want to read on: if you have an energy landscape and you want to find a global minimum, a great way to do it is to start at some initial guess and then wander around, going uphill sometimes and downhill sometimes, but with some kind of bias towards going downhill. See the AlphaPhoenix video for a nice example. This works even better than going straight downhill because you don't want to get stuck in local minima.
The typical algorithm for this is you sample a step and then always take it if it's going downhill, but only take it with some probability if it leads uphill (with smaller probability the more uphill it is). But another algorithm that's very similar is to just take smaller steps when going uphill than when going downhill.
If you were never told about the energy landscape, but you are told about a pattern of larger and smaller steps you're supposed to take based on stochastically sampled directions, than an interesting question is: when can you infer an energy function that's implicitly getting optimized for?
Obviously, if the sampling is uniform and the step size when going uphill looks like it could be generated by taking the reciprocal of the derivative of an energy function, you should start getting suspicious. But what if the sampling is nonuniform? What if there's no cap on step size? What if the step size rule has cycles or other bad behavior? Can you still model what's going on as a markov chain monte carlo procedure plus some extra stuff?
I don't know, these seem like interesting questions in learning theory. If you search for questions like "under what conditions does the REINFORCE algorithm find a global optimum," you find papers like this one that don't talk about MCMC, so maybe I've lost the plot.
But anyhow, this seems like the shape of the answer. If you pick random steps to take but take bigger steps according to some rule, then that rule might be telling you about an underlying energy landscape you're doing a markov chain monte carlo walk around.
Is separate evaluation + context-insensitive aggregation actually how helpful + harmless is implemented in a reward model for any major LLM? I think Anthropic uses finetuning on a mixture of specialized training sets (plus other supervision that's more holistic) which is sort of like this but allows the model to generalize in a way that compresses the data, not just a way that keeps the same helpful/harmless tradeoff.
Anyhow, of course we'd like to use the "beneficial for humanity" goal, but sadly we don't have access to it at the moment :D Working on it.
Well, it could be like that. Seems additionally unlikely the field as a whole would be bottlnecked just because the decoder-only transformer architecture is, though :/
So a proportional vote interstate compact? :)
I like it - I think one could specify an automatic method for striking a fair bargain between states (and only include states that use that method in the bargain). Then you could have states join the compact asynchronously.
E.g. if the goal is to have the pre-campaign expected electors be the same, and Texas went 18/40 Biden in 2020 while California went 20/54 Trump in 2020, maybe in 2024 Texas assigns all its electors proportionally, while California assigns 49 electors proportionally and the remaining 5 by majority. That would cause the numbers to work out the same (plus or minus a rounding error).
Suppose Connecticut also wants to join the compact, but it's also a blue state. I think the obvious thing to do is to distribute the expected minority electors proportional to total elector count - if Connecticut has 7 electors, it's responsible for balancing 7/61 of the 18 minority electors that are being traded, or just about exactly 2 of them.
But the rounding is sometimes awkward - if we lived in a universe where Connecticut had 9 electors instead, it would be responsible for just about exactly 2.5 minority electors, which is super awkward especially if a lot of small states join and start accumulating rounding errors.
What you could do instead is specify a loss function: you take the variance of the proportion of electors assigned proportionally among the states that are on the 'majority' side of the deal, multiply that by a constant (probably something small like 0.05, but obviously you do some simulations and pick something more informed), add the squared rounding error of expected minority electors, and that's your measure for how imperfect the assignment of proportional electors to states is. Then you just pick the assignment that's least imperfect.
Add in some automated escape hatches in case of change of major parties, change of voting system, or being superseded by a more ambitious interstate compact, and bada bing.
Probably? But I'll feel bad if I don't try to talk you out of this first.
- It's true that alien sightings, videos of UFOs, etc. are slowly accumulating evidence for alien visitors, even if each item has reasonable mundane excuses (e.g. 'the mysterious shape on the infrared footage was probably just a distant plane or missile' or 'the eyewitness probably lost track of time but has doubled-down on being confident they didn't'). However, all the time that passes without yet-stronger evidence for aliens is evidence against alien visitors.
- You could imagine aliens landing in the middle of the Superbowl, or sending us each a messenger drone, or the US government sending alien biological material to 50 different labs composed of hundreds of individual researchers, who hold seminars on what they're doing that you can watch on youtube. Every year nothing like this happens puts additional restrictions on alien-visitor hypotheses, which I think outweigh the slow trickle of evidence from very-hard-to-verify encounters. Relative to our informational state in the year 2000, alien visitors actually seem less likely to me.
- Imagine someone making a 5-year bet on whether we'd have publicly-replicable evidence of alien visitors, every 5 years since the 1947 Roswell news story. Like, really imagine losing this bet 15 times in a row. Even conditional on the aliens being out there, clearly there's a pretty good process keeping the truth from getting out, and you should be getting more and more confident that this process won't break down in the next 5 years.
- Even if we're in a simulation, there is no particular reason for the simulators to be behind UAP.
- Like, suppose you're running an ancestor simulation of Earth. Maybe you're a historian interested in researching our response to perturbations to the timeline that you think really could have happened, or maybe you're trying to recreate a specific person to pull out of the simulation, or you're self-inserting to have lots of great sex, or self-inserting along with several of your friends to play some sort of decades-long game. Probably you have much better things to do with this simulation than inserting some hard-to-verify floating orbs into the atmosphere.
- There is a 'UFO entertainment industry' that is creating an adverse information environment for us.
- E.g. Skinwalker Ranch is a place where some people have seen spooky stuff. But it's also a way to sell books and TV shows and get millions-of-dollars government grants, each of which involves quite a few people whose livelihood now depends not only on the spookiness of this one place, but more generally on how spooky claims are treated by the public and the US government.
- There's an analogy here to the NASA Artemis program, which involves a big web of contractors whose livelihoods depended on the space shuttle program. These contractors, and politicians working with them, and government managers who like being in charge of larger programs, all benefit from what we might call an "adverse information environment" regarding how valuable certain space programs are, how well they'll work, and how much they'll cost.
Thanks for the reply :)
So I agree it would be an advance, but you could solve inner alignment in the sense of avoiding mesaoptimizers, yet fail to solve inner alignment in the senses of predictable generalization or stability of generalization across in-lifetime learning.
This seems worryingly like "advance capabilities now, figure out the alignment later." The reverse would be to put the upfront work into alignment desiderata, and decide on a world model later.
Yes, I know every big AI project is also doing the former thing, and doing the latter thing is a much murkier and more philosophy-ridden problem.
As someone who took the bet last time, I feel like I have a little standing to ask you: Why?
Like, a year ago there were various things in the news that raised both the saliency of UFOs and the probability of weird explanations for them (conditional on certain models of the world). There was even a former intelligence officer testifying to congress that they had heard from a guy that the USG had alien bodies stashed away.
Since then we have heard exactly bupkis, as well predicted by the "He probably exaggerated, eyewitness testimony lol" hypothesis, and less well predicted by the "Aliens are visiting our planet a lot" hypothesis. More importantly, UFOs/UAP aren't being talked about nearly as much. So I'm surprised you're offering a bet now both from an evidential and statistical perspective.
Sure, there's as always a steady trickle of videos showing something mysterious moving around, but with a similarly steady trickle of mundane explanations. Is there something in particular that's changed your mind recently?
Well, it makes things better. But it doesn't assure humanity's success by any means. Basically I agree but will just redirect you back to my analogy about why the paper "How to solve nuclear reactor design" is strange.
I don't think Joe is proposing we find an AI design that is impossible to abuse even by malicious humans. The point so far seems to be making sure your own AI is not going to do some specific bad stuff.
If you solve the latter, you have not solved the former at all; if you solve the former, someone will solve the latter.
Insofar as this is true in your extended analogy, I think that's a reflection of "completely proliferation-proof reactor" being a bad thing to just assume you can solve.
Here's an admittedly-uncharitable analogy that might still illuminate what I think is lacking from this picture:
Suppose someone in 1947 wrote a paper called "How we could solve nuclear reactor design," and it focused entirely on remotely detecting nuclear bombs, modeling piles of uranium to predict when they might be nuclear bombs, making modifications to piles of uranium to make them less like nuclear bombs, ultrasound scanning piles of uranium to see if they have bomb-like structures hidden inside, etc.
Then in a subsection, they touch on how to elicit electricity from uranium, and they say that many different designs could work, just boil water and drive a turbine or something, people will work out the details later if they don't blow themselves up with a nuclear bomb. And that the nuclear reactor design solution they mean isn't about building some maximum efficiency product, it's about getting any electricity at all without blowing yourself up.
This paper seems strange because all the stuff people have to figure out later is most of the nuclear reactor design. What's the fuel composition and geometry? What's the moderator material and design? How is radiation managed? Pressure contained? Etc.
Or to leave the analogy, it seems like building safe, useful, and super-clever AI has positive content; if you start from the inverse of an AI that does the bad takeover behavior, you still have most of the work left to go. How to initialize the AI's world model and procedure for picking actions? What to get human feedback on, and how to get that feedback? How to update or modify the AI in response to human feedback? Maybe even how to maintain transparency to other human stakeholders, among other social problems? These decisions are the meat and potatoes of building aligned AI, and they have important consequences even if you've ruled out bad takeover behavior.
If you make a nuclear reactor that's not a bomb, you can still get radiation poisoning, or contaminate the groundwater, or build a big pile of uranium that someone else can use to make a bomb, or just have little impact and get completely eclipsed by the next reactor project.
Sumilarly, an AI that doesn't take over can still take actions optimized to achieve some goal humans wouldn't like, or have a flawed notion of what it is humans want from it (e.g. generalized sycophancy), or have negative social impacts, or have components that someone will use for more-dangerous AI, or just have low impact.