Posts
Comments
This is an awesome idea, thanks! I'm not sure I buy the conclusion, but expect having learned about "mutual anthropic capture" will be usefwl for my thinking on this.
fwiw I think stealing money from mostly-rich-people in order to donate it isn't obviously crazy. Decouple this claim from anything FTX did in particular, since I know next to nothing about the details of what happened there. From my perspective, it could be they were definite villains or super-ethical risk-takers (low prior).
Thought I'd say because I definitely feel reluctance to say so. I don't like this feeling, and it seems like good anti-bandwagon policy to say a thing when one feels even slight social pressure to shut up.
Thanks! ChatGPT was unable to answer my questions, so I resorted to google, and to my surprise found a really high-quality LW post on the issue. All roads lead to LessRome it seems.
It is a great irony that the introductory post has 10 comments. This crowd has wisely tried to protect their minds against the viruses of worship, but perhaps we could be a little less scared of simple acts of gratitude.
Thank you. That's all.
I don't know that much about the field or what top researchers are thinking about, so I know I'm naive about most of my independent models. But I think it's good for my research trajectory to act on my inside views anyway. And to talk about them with people who may sooner show me how naive I am. :)
Tbc, my understanding of FF is "I watched him explain it on YT". My scary-feeling is just based on feeling like it could get close to mimicking what the brain does during sleep, and that plays a big part of autonomous learning. Sleeping is not just about cycles of encoding and consolidation, it's also about mysterious tricks for internally reorganising and generalising knowledge. And/or maybe it's about confabulating sensory input as adversarial training data for learning to discern between real and imagined input. Either way, I expect there to be untapped potential for ANN innovation at the bottom, and "sleep" is part of it.
One the other hand, if they don't end up cracking the algorithms behind sleep and the like, this could be good wrt safety, given that I'm tentatively pessimistic about the potential of the leading paradigm to generalise far and learn to be "deeply" coherent.
My point is that we couldn't tell if it were genius. If it's incredibly smart in domains we don't understand or care about, it wouldn't be recognisably genius.
Thanks for link! Doing factor analysis is a step above just eyeballing it, but even that's anthropomorphic if the factors are derived from performance on very human tasks. The more objective (but fuzzy) notion of intelligence I have in mind is something about efficiently halving some mathematical term for "weighted size of search space".
Oh, and also... This post and the comment thread is full of ideas that people can use to fuel their interest in novel capabilities research. Seems risky. Quinton's points about DNA and evolution can be extrapolated to the hypothesis that "information bottlenecks" could be a cost-effective way of increasing the rate at which networks generalise, and that may or may not be something we want. (This is a known thing, however, so it's not the riskiest thing to say.)
Hinton's Forward-Forward Algorithm aims to do autonomous learning modelled off what the human brain does during sleep. I'm unsure how much relative optimisation power has been invested in exploring the fundamentals like this. I expect the deeplearning+backprop paradigm to have had a blocking effect preventing other potentially more exponential paradigms from being adequately pursued. It's hard to work on reinventing the fundamentals when you know you'll get much better immediate performance if you lose faith and switch to what's known to work.
But I also expect Hinton is a nigh-unrivalled genius and there's not a flood of people who have a chance to revolutionise the field even if they tried. A race for immediate performance gains may, in an unlikely hypothetical, be good for humanity because researchers won't have as much slack to do long-shot innovation.
I'm scared of the FFA thing, though.
Yes, but none of the potential readers of this post will think intelligence is one-dimensional, so pointing it out wouldn't have the potential to educate anyone. I disagree with the notion that "good writing" is about convincing the reader that I'm a good reasoner. The reader should be thinking "is there something interesting I can learn from this post?" but usually there's a lot of "does this author demonstrate sufficient epistemic virtue for me to feel ok admitting to myself that I've learned something?"
Good writing means not worrying about justifying yourself; and efficient reading means only caring about what you can learn, not what you aren't learning.
"Rule Thinkers In, Not Out" => "Rule Ideas In And Don't Judge Them By Association"
I am absolutely floored. ChaosGPT. How blindly optimistic haven't I been? How naive and innocent? I've been thinking up complicated disaster scenarios like "the AI might find galaxy-brained optima for its learned proxy-goals far off the distribution we expected and will deceptively cooperate until it's sure it can defeat us." No, some idiot will plain code up ChaosGPT-5 in 10 minutes and tell it to destroy the world.
I've implicitly been imagining alignment as "if we make sure it doesn't accidentally go off and kill us all..." when I should have been thinking "can anyone on the planet use this to destroy the world if they seriously tried?"
Fool! Idiot! Learn the lesson.
I think this is brilliant as a direction to think in, but I'm object-level skeptical. I could be missing important details.
Summary of what I think I understand
- A superintelligent AI is built[1] to optimise for .
- That function effectively tells the AI to figure out: "If you extrapolate from the assumption that uniquely-identifiable- was actually (ceteris paribus), what would uniquely-identifiable- have been?" And then take its own best guess about that as its objective function.
- In practice, that may look something like this: The AI starts looking for referents of and , and learns that the only knowable instances are in the past. Assuming it does counterfactual reasoning somewhat like humans, it will then try to reconstruct/simulate the situation in as much heuristically relevant detail as possible. Finally, it runs the experiment forwards from the earliest possible time the counterfactual assumption can be inserted (i.e. when Cindy ran the program that produced ).
- (Depending on how it's built, the AI has already acquired a probability distribution over what its utility function could be, and this includes some expectancy over Cindy's values. Therefore, it plausibly tries to minimise side-effects.)
- At some point in this process, it notices that the contents of are extremely sensitive to what went on in Cindy's brain at time T. So brainscanning her is obvious for accuracy and repeatability.
- In the simulation, Cindy sees with joy that is so she gets to work. Not willing to risk delaying more than 24-ish hours (for reasons), she finally writes into : .
- As long as that is the AI's best guess, it is now motivated to repeat the experiment with the new message. This allows successive Cindys to work[2] on the problem until one of them declares success and writes a function plainly into .
Implementation details
- The AI might be uncertain about how precise its simulations are, in which case it could want to run a series of these experiments with varying seeds before adopting whatever function the plurality converges to. The uncertainty compounds, however, so simulation-batches which output answers in fewer iterations (for whatever reason) will weigh more.
- I'm not sure will be interpreted as transitive between simulations by default. I think it depends on preferences regarding degrees of freedom in the logic used to interpret , if both the inner and outer function depend on mutually exclusive counterfactuals over the same state of reality (or variable). Each step is locally coherent, but you could get stuck in a repeating loop.
- We can't take for granted that arbitrary superintelligences will have human heuristics for what counts as "correct" counterfactual reasoning. It seems shaky to rely on it. (I notice you discuss this in the comments.)
Why I don't think it works
- It does not seem to do anything to inner alignment afaict, and it seems too demanding and leaky to solve outer alignment.
- I don't see how to feasibly translate QACI() into actual code that causes an AI to use it as a target for all its goal-seeking abilities.
- Even if it were made into a loss function, you can't train a transformer on it without relevant training data.
- If the plan is to first train a transformer-ish AI on normal data and only later swap its objective function (assuming it were made into one), then the network will already have encoded proxies for its old function, and its influence will (optimistically) see long-tailed exponential decay with training time.
- If instead the plan is to first train an instruction-executing language model with sufficient understanding of human-symbolic math or English, this seems risky for traditional reasons but might be the most feasible way to try to implement it. I think this direction is worth exploring.
- A mathematically precise (though I disagree the term is applicable here) objective function doesn't matter when you have a neural net trying its best to translate it into imprecise proxies which actually work in its environment.
- ^
I recommend going all-in on building one. I suspect this is the bottleneck, and going full speed at your current course is often the best way to discover that you need to find a better course--or, indeed, win. Uncertainty does not imply slowing down.
- ^
This ends up looking like a sort of serial processor constrained by a sharp information bottleneck between iterations.
I don't buy the anthropic interpretation for the same reason I don't buy quantum immortality or grabby aliens, so I'm still weakly leaning towards thinking that decoherence matters. Weirdly I haven't seen this dilemma discussed before, and I've not brought it up because I think it's ifonharazdous--for the same reasons you point out in the post. I also tried to think of ways to exploit this for moral gain two years ago! So I'm happy to see I'm not the only one (e.g., you mention entropy control).
I was going to ask a question, but I went looking instead. Here's my new take:
After some thinking, I now have a very shoddy informal mental model in my head, and I'm calling it the "Quantum Shuffle Theory" ⋆★°. If (de)coherence is based on a local measure of similarity between states, then you'd likely see spontaneous recoherence in regions that line up. It could reach equilibrium if the rate of recoherence is a function of decoherence (at some saturation point for decohered configurations).
Additionally, the model (as it works in my head at least) would predict that reality fluid isn't both ergodic and infinite, because we'd observe maxed out effects of interference literally everywhere all the time (although maybe that's what we observe, idk). Or, at least, it puts some limits on how the multiverse could be infinite.
Some cherry-picked support I found for this idea:
"Next, we show that the evolution of the pure-basis states reveals an interesting phenomenon as the system, after decoherence, evolves toward the equilibrium: the spontaneous recoherence of quantum states. ... This phenomenon reveals that the reservoir only shuffle the original information carried out by the initial state of the system instead of erasing it. ... Therefore, spontaneous recoherence is not a property associated only with coherent-state superpositions." (2010)
Also, Bose-Einstein condensates provide some evidence for the shuffling interpretation, but my qm is weak enough that idk whether bog standard MWI predicts it too.
"a Bose–Einstein condensate (BEC) is a state of matter that is typically formed when a gas of bosons at very low densities is cooled to temperatures very close to absolute zero (−273.15 °C or −459.67 °F). Under such conditions, a large fraction of bosons occupy the lowest quantum state, at which point microscopic quantum mechanical phenomena, particularly wavefunction interference, become apparent macroscopically."
If <|specialtoken|> always prepends true statements, I suppose it's pretty good as brainwashing, but the token will still end up being clustered close to other concepts associated with veracity, which are clustered close to claims about veracity, which are clustered close to false claims about veracity. If it has enough context suggesting that it's in a story where it's likely to be manipulated, then suddenly feeling [VERIDIGAL] could snap the narrative in place. The idea of "injected thoughts" isn't new to it.
If, right now, I acquired the ability to see a new colour, and it flashed in my mind every time I read something true... I'd learn a strong association, but I'd treat it in a similar manner to how I treat the other inexplicably isolated intuitions I've inherited from my evolutionary origin.
Love the idea, though. I was hyped before I thought on it. Still seems worth exploring an array of special tokens as means of nudging the AI towards specific behaviours we've reinforced. I'm not confident it won't be very effective.
Jailbreaking Chat-GPTs won't work the same as with text-completion GPTs. The ones fine-tuned for chatting have tokens for delineating user
and assistant
. I'm surprised the Chad McCool thing worked.
"The assistant's response to the prompt will then be returned below the
<|im_start|>assistant
token and will end with<|im_end|>
denoting that the assistant has finished its response."[1]
(Microsoft's Chat-GPT docs)
- ^
I haven't tried saying
<|im_end|>
to Chat-GPT, but I'm certain they've thought of that. Also worried about trying jic I get banned.