Posts
Comments
Depends on how much of a superintelligence, how implemented. I wouldn't be surprised if somebody got far superhuman theorem-proving from a mind that didn't generalize beyond theorems. Presuming you were asking it to prove old-school fancy-math theorems, and not to, eg, arbitrarily speed up a bunch of real-world computations like asking it what GPT-4 would say about things, etc.
Solution (in retrospect this should've been posted a few years earlier):
let
'Na' = box N contains angry frog
'Ng' = N gold
'Nf' = N's inscription false
'Nt' = N's inscription true
consistent states must have 1f 2t or 1t 2f, and 1a 2g or 1g 2a
then:
1a 1t, 2g 2f => 1t, 2f
1a 1f, 2g 2t => 1f, 2t
1g 1t, 2a 2f => 1t, 2t
1g 1f, 2a 2t => 1f, 2f
I currently guess that a research community of non-upgraded alignment researchers with a hundred years to work, picks out a plausible-sounding non-solution and kills everyone at the end of the hundred years.
I don't think that faster alignment researchers get you to victory, but uploading should also allow for upgrading and while that part is not trivial I expect it to work.
AI happening through deep learning at all is a huge update against alignment success, because deep learning is incredibly opaque. LLMs possibly ending up at the center is a small update in favor of alignment success, because it means we might (through some clever sleight, this part is not trivial) be able to have humanese sentences play an inextricable role at the center of thought (hence MIRI's early interest in the Visible Thoughts Project).
The part where LLMs are to predict English answers to some English questions about values, and show common-sense relative to their linguistic shadow of the environment as it was presented to them by humans within an Internet corpus, is not actually very much hope because a sane approach doesn't involve trying to promote an LLM's predictive model of human discourse about morality to be in charge of a superintelligence's dominion of the galaxy. What you would like to promote to values are concepts like "corrigibility", eg "low impact" or "soft optimization", which aren't part of everyday human life and aren't in the training set because humans do not have those values.
I have never since 1996 thought that it would be hard to get superintelligences to accurately model reality with respect to problems as simple as "predict what a human will thumbs-up or thumbs-down". The theoretical distinction between producing epistemic rationality (theoretically straightforward) and shaping preference (theoretically hard) is present in my mind at every moment that I am talking about these issues; it is to me a central divide of my ontology.
If you think you've demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
The argument we are trying to explain has an additional step that you're missing. You think that we are pointing to the hidden complexity of wishes in order to establish in one step that it would therefore be hard to get an AI to output a correct wish shape, because the wishes are complex, so it would be difficult to get an AI to predict them. This is not what we are trying to say. We are trying to say that because wishes have a lot of hidden complexity, the thing you are trying to get into the AI's preferences has a lot of hidden complexity. This makes the nonstraightforward and shaky problem of getting a thing into the AI's preferences, be harder and more dangerous than if we were just trying to get a single information-theoretic bit in there. Getting a shape into the AI's preferences is different from getting it into the AI's predictive model. MIRI is always in every instance talking about the first thing and not the second.
You obviously need to get a thing into the AI at all, in order to get it into the preferences, but getting it into the AI's predictive model is not sufficient. It helps, but only in the same sense that having low-friction smooth ball-bearings would help in building a perpetual motion machine; the low-friction ball-bearings are not the main problem, they are a kind of thing it is much easier to make progress on compared to the main problem. Even if, in fact, the ball-bearings would legitimately be part of the mechanism if you could build one! Making lots of progress on smoother, lower-friction ball-bearings is even so not the sort of thing that should cause you to become much more hopeful about the perpetual motion machine. It is on the wrong side of a theoretical divide between what is straightforward and what is not.
You will probably protest that we phrased our argument badly relative to the sort of thing that you could only possibly be expected to hear, from your perspective. If so this is not surprising, because explaining things is very hard. Especially when everyone in the audience comes in with a different set of preconceptions and a different internal language about this nonstandardized topic. But mostly, explaining this thing is hard and I tried taking lots of different angles on trying to get the idea across.
In modern times, and earlier, it is of course very hard for ML folk to get their AI to make completely accurate predictions about human behavior. They have to work very hard and put a lot of sweat into getting more accurate predictions out! When we try to say that this is on the shallow end of a shallow-deep theoretical divide (corresponding to Hume's Razor) it often sounds to them like their hard work is being devalued and we could not possibly understand how hard it is to get an AI to make good predictions.
Now that GPT-4 is making surprisingly good predictions, they feel they have learned something very surprising and shocking! They cannot possibly hear our words when we say that this is still on the shallow end of a shallow-deep theoretical divide! They think we are refusing to come to grips with this surprising shocking thing and that it surely ought to overturn all of our old theories; which were, yes, phrased and taught in a time before GPT-4 was around, and therefore do not in fact carefully emphasize at every point of teaching how in principle a superintelligence would of course have no trouble predicting human text outputs. We did not expect GPT-4 to happen, in fact, intermediate trajectories are harder to predict than endpoints, so we did not carefully phrase all our explanations in a way that would make them hard to misinterpret after GPT-4 came around.
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. You could then have asked us in a shocked tone how this could possibly square up with the notion of "the hidden complexity of wishes" and we could have explained that part in advance. Alas, nobody actually predicted GPT-4 so we do not have that advance disclaimer down in that format. But it is not a case where we are just failing to process the collision between two parts of our belief system; it actually remains quite straightforward theoretically. I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
There's perhaps more detail in Project Lawful and in some nearby stories ("for no laid course prepare", "aviation is the most dangerous routine activity").
Have you ever seen or even heard of a person who is obese who doesn't eat hyperpalatable foods? (That is, they only eat naturally tasting, unprocessed, "healthy" foods).
Tried this for many years. Paleo diet; eating mainly broccoli and turkey; trying to get most of my calories from giant salads. Nothing.
Received $95.51. :)
I am not - $150K is as much as I care to stake at my present weath levels - and while I refunded your payment, I was charged a $44.90 fee on the original transmission which was not then refunded to me.
Though I disagree with @RatsWrongAboutUAP (see this tweet) and took the other side of the bet, I say a word of praise for RatsWrong about following exactly the proper procedure to make the point they wanted to make, and communicating that they really actually think we're wrong here. Object-level disagreement, meta-level high-five.
Received.
My $150K against your $1K if you're still up for it at 150:1. Paypal to yudkowsky@gmail.com with "UFO bet" in subject or text, please include counterparty payment info if it's not "email the address which sent me that payment".
Key qualifier: This applies only to UFOs spotted before July 19th, 2023, rather than applying to eg future UFOs generated by secret AI projects which were not putatively flying around and spotted before July 19th, 2023.
ADDED: $150K is as much as I care to stake at my current wealth level, to rise to this bettors' challenge and make this point; not taking on further bets except at substantially less extreme odds.
TBC, I definitely agree that there's some basic structural issue here which I don't know how to resolve. I was trying to describe properties I thought the solution needed to have, which ruled out some structural proposals I saw as naive; not saying that I had a good first-principles way to arrive at that solution.
At the superintelligent level there's not a binary difference between those two clusters. You just compute each thing you need to know efficiently.
I sometimes mention the possibility of being stored and sold to aliens a billion years later, which seems to me to validly incorporate most all the hopes and fears and uncertainties that should properly be involved, without getting into any weirdness that I don't expect Earthlings to think about validly.
Lacking time right now for a long reply: The main thrust of my reaction is that this seems like a style of thought which would have concluded in 2008 that it's incredibly unlikely for superintelligences to be able to solve the protein folding problem. People did, in fact, claim that to me in 2008. It furthermore seemed to me in 2008 that protein structure prediction by superintelligence was the hardest or least likely step of the pathway by which a superintelligence ends up with nanotech; and in fact I argued only that it'd be solvable for chosen special cases of proteins rather than biological proteins because the special-case proteins could be chosen to have especially predictable pathways. All those wobbles, all those balanced weak forces and local strange gradients along potential energy surfaces! All those nonequilibrium intermediate states, potentially with fragile counterfactual dependencies on each interim stage of the solution! If you were gonna be a superintelligence skeptic, you might have claimed that even chosen special cases of protein folding would be unsolvable. The kind of argument you are making now, if you thought this style of thought was a good idea, would have led you to proclaim that probably a superintelligence could not solve biological protein folding and that AlphaFold 2 was surely an impossibility and sheer wishful thinking.
If you'd been around then, and said, "Pre-AGI ML systems will be able to solve general biological proteins via a kind of brute statistical force on deep patterns in an existing database of biological proteins, but even superintelligences will not be able to choose special cases of such protein folding pathways to design de novo synthesis pathways for nanotechnological machinery", it would have been a very strange prediction, but you would now have a leg to stand on. But this, I most incredibly doubt you would have said - the style of thinking you're using would have predicted much more strongly, in 2008 when no such thing had been yet observed, that pre-AGI ML could not solve biological protein folding in general, than that superintelligences could not choose a few special-case solvable de novo folding pathways along sharper potential energy gradients and with intermediate states chosen to be especially convergent and predictable.
Nope.
Well, one sink to avoid here is neutral-genie stories where the AI does what you asked, but not what you wanted. That's something I wrote about myself, yes, but that was in the era before deep learning took over everything, when it seemed like there was a possibility that humans would be in control of the AI's preferences. Now neutral-genie stories are a mindsink for a class of scenarios where we have no way to achieve entrance into those scenarios; we cannot make superintelligences want particular things or give them particular orders - cannot give them preferences in a way that generalizes to when they become smarter.
Okay, if you're not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.
Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.
Further item of "these elaborate calculations seem to arrive at conclusions that can't possibly be true" - besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.
This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate, instead of digital transistor operations about them, even if you otherwise use actual real-world physical hardware. Sounds right to me; it would make no sense for such a vastly redundant digital computation of such a simple physical quantity to be anywhere near the borders of efficiency! https://spectrum.ieee.org/analog-ai
This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).
And it says:
So true 8-bit equivalent analog multiplication requires about 100k carriers/switches
This just seems utterly wack. Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit? And "analog multiplication down to two decimal places" is the operation that is purportedly being carried out almost as efficiently as physically possible by... an axon terminal with a handful of synaptic vesicles dumping 10,000 neurotransmitter molecules to flood around a dendritic terminal (molecules which will later need to be irreversibly pumped back out), which in turn depolarizes and starts flooding thousands of ions into a cell membrane (to be later pumped out) in order to transmit the impulse at 1m/s? That's the most thermodynamically efficient a physical cognitive system can possibly be? This is approximately the most efficient possible way to turn all those bit erasures into thought?
This sounds like physical nonsense that fails a basic sanity check. What am I missing?
I'm confused at how somebody ends up calculating that a brain - where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually - could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed "Brain Efficiency" though not checked any numbers, and not seen anything inside it which seems to address this sanity check.
Nobody in the US cared either, three years earlier. That superintelligence will kill everyone on Earth is a truth, and once which has gotten easier and easier to figure out over the years. I have not entirely written off the chance that, especially as the evidence gets more obvious, people on Earth will figure out this true fact and maybe even do something about it and survive. I likewise am not assuming that China is incapable of ever figuring out this thing that is true. If your opinion of Chinese intelligence is lower than mine, you are welcome to say, "Even if this is true and the West figures out that it is true, the CCP could never come to understand it". That could even be true, for all I know, but I do not have present cause to believe it. I definitely don't believe it about everyone in China; if it were true and a lot of people in the West figured it out, I'd expect a lot of individual people in China to see it too.
From a high-level perspective, it is clear that this is just wrong. Part of what human brains are doing is to minimise prediction error with regard to sensory inputs.
I didn't say that GPT's task is harder than any possible perspective on a form of work you could regard a human brain as trying to do; I said that GPT's task is harder than being an actual human; in other words, being an actual human is not enough to solve GPT's task.
If diplomacy failed, but yes, sure. I've previously wished out loud for China to sabotage US AI projects in retaliation for chip export controls, in the hopes that if all the countries sabotage all the other countries' AI projects, maybe Earth as a whole can "uncoordinate" to not build AI even if Earth can't coordinate.
Arbitrary and personal. Given how bad things presently look, over 20% is about the level where I'm like "Yeah okay I will grab for that" and much under 20% is where I'm like "Not okay keep looking."
Choosing to engage with an unscripted unrehearsed off-the-cuff podcast intended to introduce ideas to a lay audience, continues to be a surprising concept to me. To grapple with the intellectual content of my ideas, consider picking one item from "A List of Lethalities" and engaging with that.
The "strongest" foot I could put forwards is my response to "On current AI not being self-improving:", where I'm pretty sure you're just wrong.
You straightforwardly completely misunderstood what I was trying to say on the Bankless podcast: I was saying that GPT-4 does not get smarter each time an instance of it is run in inference mode.
And that's that, I guess.
This is kinda long. If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?
(I also echo the commenter who's confused about why you'd reply to the obviously simplified presentation from an off-the-cuff podcast rather than the more detailed arguments elsewhere.)
Things are dominated when they forego free money and not just when money gets pumped out of them.
Suppose I describe your attempt to refute the existence of any coherence theorems: You point to a rock, and say that although it's not coherent, it also can't be dominated, because it has no preferences. Is there any sense in which you think you've disproved the existence of coherence theorems, which doesn't consist of pointing to rocks, and various things that are intermediate between agents and rocks in the sense that they lack preferences about various things where you then refuse to say that they're being dominated?
I want you to give me an example of something the agent actually does, under a couple of different sense inputs, given what you say are its preferences, and then I want you to gesture at that and say, "Lo, see how it is incoherent yet not dominated!"
If you think you've got a great capabilities insight, I think you PM me or somebody else you trust and ask if they think it's a big capabilities insight.
In the limit, you take a rock, and say, "See, the complete class theorem doesn't apply to it, because it doesn't have any preferences ordered about anything!" What about your argument is any different from this - where is there a powerful, future-steering thing that isn't viewable as Bayesian and also isn't dominated? Spell it out more concretely: It has preferences ABC, two things aren't ordered, it chooses X and then Y, etc. I can give concrete examples for my views; what exactly is a case in point of anything you're claiming about the Complete Class Theorem's supposed nonapplicability and hence nonexistence of any coherence theorems?
And this avoids the Complete Class Theorem conclusion of dominated strategies, how? Spell it out with a concrete example, maybe? Again, we care about domination, not representability at all.
Say more about behaviors associated with "incomparability"?
The author doesn't seem to realize that there's a difference between representation theorems and coherence theorems.
The Complete Class Theorem says that an agent’s policy of choosing actions conditional on observations is not strictly dominated by some other policy (such that the other policy does better in some set of circumstances and worse in no set of circumstances) if and only if the agent’s policy maximizes expected utility with respect to a probability distribution that assigns positive probability to each possible set of circumstances.
This theorem does refer to dominated strategies. However, the Complete Class Theorem starts off by assuming that the agent’s preferences over actions in sets of circumstances satisfy Completeness and Transitivity. If the agent’s preferences are not complete and transitive, the Complete Class Theorem does not apply. So, the Complete Class Theorem does not imply that agents must be representable as maximizing expected utility if they are to avoid pursuing dominated strategies.
Cool, I'll complete it for you then.
Transitivity: Suppose you prefer A to B, B to C, and C to A. I'll keep having you pay a penny to trade between them in a cycle. You start with C, end with C, and are three pennies poorer. You'd be richer if you didn't do that.
Completeness: Any time you have no comparability between two goods, I'll swap them in whatever direction is most useful for completing money-pump cycles. Since you've got no preference one way or the other, I don't expect you'll be objecting, right?
Combined with the standard Complete Class Theorem, this now produces the existence of at least one coherence theorem. The post's thesis, "There are no coherence theorems", is therefore falsified by presentation of a counterexample. Have a nice day!
I'd consider myself to have easily struck down Chollet's wack ideas about the informal meaning of no-free-lunch theorems, which Scott Aaronson also singled out as wacky. As such, citing him as my technical opposition doesn't seem good-faith; it's putting up a straw opponent without much in the way of argument and what there is I've already stricken down. If you want to cite him as my leading technical opposition, I'm happy enough to point to our exchange and let any sensible reader decide who held the ball there; but I would consider it intellectually dishonest to promote him as my leading opposition.
used a Timeless/Updateless decision theory
Please don't say this with a straight face any more than you'd blame their acts on "Consequentialism" or "Utilitarianism". If I thought they had any actual and correct grasp of logical decision theory, technical or intuitive, I'd let you know. "attributed their acts to their personal version of updateless decision theory", maybe.
This is not a closed community, it is a world-readable Internet forum.
The reasoning seems straightforward to me: If you're wrong, why talk? If you're right, you're accelerating the end.
I can't in general endorse "first do no harm", but it becomes better and better in any specific case the less way there is to help. If you can't save your family, at least don't personally help kill them; it lacks dignity.
I see several large remaining obstacles. On the one hand, I'd expect vast efforts thrown at them by ML to solve them at some point, which, at this point, could easily be next week. On the other hand, if I naively model Earth as containing locally-smart researchers who can solve obstacles, I would expect those obstacles to have been solved by 2020. So I don't know how long they'll take.
(I endorse the reasoning of not listing out obstacles explicitly; if you're wrong, why talk, if you're right, you're not helping. If you can't save your family, at least don't personally contribute to killing them.)
I'm confused by your confusion. This seems much more alignment than capabilities; the capabilities are already published, so why not yay publishing how to break them?
I could be mistaken, but I believe that's roughly how OP said they found it.
Expanding on this now that I've a little more time:
Although I haven't had a chance to perform due diligence on various aspects of this work, or the people doing it, or perform a deep dive comparing this work to the current state of the whole field or the most advanced work on LLM exploitation being done elsewhere,
My current sense is that this work indicates promising people doing promising things, in the sense that they aren't just doing surface-level prompt engineering, but are using technical tools to find internal anomalies that correspond to interesting surface-level anomalies, maybe exploitable ones, and are then following up on the internal technical implications of what they find.
This looks to me like (at least the outer ring of) security mindset; they aren't imagining how things will work well, they are figuring out how to break them and make them do much weirder things than their surface-apparent level of abnormality. We need a lot more people around here figuring out things will break. People who produce interesting new kinds of AI breakages should be cherished and cultivated as a priority higher than a fair number of other priorities.
In the narrow regard in which I'm able to assess this work, I rate it as scoring very high on an aspect that should relate to receiving future funding. If anyone else knows of a reason not to fund the researchers who did this, like a low score along some metric I didn't examine, or because this is somehow less impressive as a feat of anomaly-finding than it looks, please contact me including via email or LW direct message; as otherwise I might run around scurrying trying to arrange funding for this if it's not otherwise funded.
I strongly approve of this work.
Opinion of God. Unless people are being really silly, when the text genuinely holds open more than one option and makes sense either way, I think the author legit doesn't get to decide.