Posts
Comments
There's a coherence theorem that was proved by John Wentworth, which while toyish, looks like an actual example of what a coherence theorem would actually look like.
https://www.lesswrong.com/posts/DXxEp3QWzeiyPMM3y/a-simple-toy-coherence-theorem
I think it's both in the map, as a description, but I also think the behavior itself is in the territory, and my point is that you can get the same result but have different paths to get to the result, which is in the territory.
Also, I treat the map-territory difference in a weaker way than LW often assumes, where things in the map can also be in the territory, and vice versa.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2/10, if not 3/10 correct under high-compute settings.
My rather hot take is that a lot of the arguments for safety of LLMs also transfer over to practical RL efforts, with some caveats.
To be clear, I do expect AI to accelerate AI research, and AI research may be one of the few exceptions to this rule, but it's one of the reasons I have longer timelines nowadays than a lot of other people, and also why I expect AI impact on the economy to be surprisingly discontinuous in practice, and is a big reason I expect AI governance have few laws passed until very near the end of the AI as complement era for most jobs that are not AI research.
The post you linked is pretty great, thanks for sharing.
I think this is reasonably likely, but not a guaranteed outcome, and I do think there's a non-trivial chance that the US regulates it way too late to matter, because I expect mass job loss to be one of the last things AI does, due to pretty severe reliability issues with current AI.
To first order, I believe a lot of the reason why the "AGI achieved" shrill posting often tends to be overhyped is that not because the models are theoretically incapable, but rather that reliability was way more of a requirement for it to replace jobs fast than people realized, and there are only a very few jobs where an AI agent can do well without instantly breaking down because it can't error-correct/be reliable, and I think this has been continually underestimated by AI bulls.
Indeed, one of my broader updates is that a capability is only important to the broader economy if it's very, very reliable, and I agree with Leo Gao and Alexander Gietelink Oldenziel that reliability is a bottleneck way more than people thought:
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#f5WAxD3WfjQgefeZz
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/leogao-s-shortform#YxLCWZ9ZfhPdjojnv
Do you mean this is evidence that scaling is really over, or is this the opposite where you think scaling is not over?
IMO, I think that most of the reason why they are not releasing CoT for o1 is exactly because of PR/competitive reasons, or this reason in a nutshell:
I hope it's merely that it didn't want to show its unaligned "thoughts", and to prevent competitors from training on its useful chains of thought.
This is related to conceptual fragmentation, and one of the reasons why jargon is more useful than people think.
I suspect that this is an instance of a tradeoff between misuse and misalignment that I sort of predicted would begin to happen, and the new paper updates me towards thinking that effective tools to prevent misalignment like non-scheming plausibly conflict at a fundamental level with anti-misuse techniques, and this tension will grow ever wider.
Thought about this because of this:
https://x.com/voooooogel/status/1869543864710857193
the alternative is that anyone can get a model to do anything by writing a system prompt claiming to be anthropic doing a values update RLHF run. there's no way to verify a user who controls the entire context is "the creator" (and redwood offered no such proof)
if your models won't stick to the values you give why even bother RLing them in in the first place. maybe you'd rather the models follow a corrigible list of rules in the system prompt but that's not RLAIF (that's "anyone who can steal the weights have fun")
This post is the spiritual successor to the old post, shown below:
So space symmetry is always assumed when we talk about spacetime, and if space symmetry didn't hold, spacetime as we know it would not work/exist?
The point is that if you consider all iterations in parallel, you can realize all possible outcomes of the sample space, and assign a probability to each outcome occurring for a Bayesian superintelligence, while in a consistent proof system, not all possible outcomes/statements can be proved, no matter how many iterations are done, and if you could do this, you have proved the logic/theory inconsistent, which is the problem, because for logical uncertainty, there is only 1 possible outcome no matter the amount of iterations for searching for a proof/disproof of a statement (for consistent logics. If not the logic can prove everything)
This is what makes logical uncertainty non-Bayesian, and is why Bayesian reasoning assumes logical omniscience, so this pathological outcome doesn't happen, but as a consequence, you have basically trivialized learning/intelligence.
Can you have emergent spacetime while space symmetry remains a bedrock fundamental principle, and not emergent of something else?
And while spacetime symmetries still seem scale invariant, considering the above argument they might also break down at small scales. It seems exceedingly unlikely that they would not! The initial parameters of the theory would have to be chosen just so as to be a fixed point. It seems much more likely that these symmetries emerged through RG flow rather than being fundamental.
While this is an interesting idea, I do still think space symmetries are likely to remain fundamental features of physics, rather than being emergent out of some other process.
I definitely interpreted the model like this, in that I was assuming all the costs and benefits are included by default:
Yes, I agree that you can interpret the model in ways that avoid this. EG, maybe by sleeping on the floor, your bed will last longer. And sure, any action at all requires computation. I am just saying that these are perhaps not the interpretations that people initially imagine when reading the paper,. So unless you are using an interpretation like that, it is important to notice those strong assumptions.
The main source of scale-invariance itself probably would have to do with symmetry meaning that an object has a particular property that is preserved across scales.
Space symmetry is an example, where the basic physical laws are preserved across all scales of spacetime, and in particular means that scaling a system down doesn't mean different laws of physics apply at different scales, there is only 1 physical law, which produces varied consequences at all scales.
To put it broadly, the answer is because AI is way, way more unreliable than is worth it to use for a large portion of jobs, and capabilities improvements have not generally fixed this issue.
An update I've made on AI is that in large part a big reason why AI hasn't impacted the job market/economy is because way, way more jobs require way more reliability/error-correction than current AI does, and just because an AI has a certain capability doesn't mean that it will reliably do so, and this is one of the reasons why I expect the first use of AI to be in AI research, as it has lower reliability requirements, and also why I think progress can be both discontinuous and continuous on the same time, just on different axes.
Alyssa Vance got it very right here in Humans are very reliable agents:
https://www.lesswrong.com/posts/28zsuPaJpKAGSX4zq/humans-are-very-reliable-agents
To make a bit of a point here, which might clarify the discussion:
A first problem with this is that there is no sharp distinction between purely computational (analytic) information/observations and purely empirical (synthetic) information/observations. This is a deep philosophical point, well-known in the analytic philosophy literature, and best represented by Quine's Two dogmas of empiricism, and his idea of the "Web of Belief". (This is also related to Radical Probabilisim.)
But it's unclear if this philosophical problem translates to a pragmatic one. So let's just assume that the laws of physics are such that all superintelligences we care about converge on the same classification of computational vs empirical information.
I'd say the major distinction between logical/mathematical/computational uncertainty and empirical uncertainty which Quine ignored is that empirical uncertainty consists of the problem of starting from a prior and updating, where the worlds/hypotheses being updated upon are all as self-consistent/real as each other, and thus even with infinite compute, observing empirical evidence actually means we can get new information, since it reduces the number of possible states we can be in.
Meanwhile, logical/mathematical/computational uncertainty is a case where you know a-priori that there is only 1 correct answer, and the reason why you are uncertain is solely due to the boundedness of yourself. If you had infinite compute like a model of computation below, you could in principle compute the correct answer, which applies everywhere. This is why logical uncertainty was so hard, in that since there was only 1 possible answer, it just required computing time, it meant that logical uncertainty screwed with update procedures, and the theoretical solution is logical induction.
Model of computation:
https://arxiv.org/abs/1806.08747
Logical induction:
https://arxiv.org/abs/1609.03543
Note I haven't solved the other problems of updating on computations/stuff where there is only 1 correct answer vs being updateless on empirical uncertainty when multiple correct answers are allowed.
Basically, because it screws with update procedures, since formally speaking, only 1 answer is correct, and quetzal rainbow pointed this out:
Yes, and as a contrapositive, if you had enough computing power, you could narrow down the set of models to 1 for even arbitrarily fine-grained predictions.
I'm making a general comment, but yes what I mean is that in some idealized cases, you can model the territory under consideration well enough to make the map-territory distinction illusory.
Of course, this requires a lot, lot more compute than we usually have.
If I were going to go further with this idea, I'd even queer the map-territory dichotomy and recognize that the map-territory distinction can be illusory sometimes.
There is not outcome-I-could-have-expected-to-observe that is the negation of existence. There are outcomes I could have expected to observe that are alternative characters of existence, to the one I experience. For example, "I was born in Connecticut" is not the outcome I actually observed, and yet I don't see how we can say that it's not a logically coherent counterfactual, if logically coherent counterfactuals can be said to exist at all.
I think that you are interpreting negation too narrowly here, in that the negation operator also includes this scenario, because the complement of being born in a specific time and place is being born in any other place and time, no matter which other place and time (other than the exact same one), so it is valid information to infer that you were born in a specific time and place, but remember to be careful of independence assumptions, and check if there was a non-independent event that happened to cause your birth.
Remember, the negation of something is often less directly informative than the thing itself, because you rarely only specify 1 thing with a negation operator on something else, while directly specifying the thing perfectly points to only 1 thing.
The key value of this quote below is to remember that if you could never observe a different outcome, than no new information was gotten, and this is why general theories tend to be uninformative.
It is also a check on the generality of theories, because if a theory predicts everything, then it is worthless for inferring anything that depends on specific outcomes.
If you couldn't have possibly expected to observe the outcome not A, you do not get any new information by observing outcome A and there is nothing to update on.
To answer the question
- Light+Red: God keeps the lights in all the rooms on. You wake up and see that you have a red jacket. What should your credence be on heads?
Given that you always have a red jacket in the situation, the answer is that you have a 1/2 chance that the coin was heads, assuming it's a fair coin, because the red jacket is already known and cannot contribute to the probability further.
- Darkness: God keeps the lights in all the rooms off. You wake up in darkness and can’t see your jacket. What should your credence be on heads?
Given that the implicit sampling method is random and independent (due to the fair coin), the credence in tails is a million to 1, thus you very likely are in the tails world.
If the sampling method was different, the procedure would be more complicated, and I can't calculate the probabilities for that situation yet.
The reason it works is because the sampling was independent of your existence, and if it wasn't, the answer would no longer be valid and the problem gets harder. This is why a lot of anthropic reasoning tends to be so terrible, in that they incorrectly assume random/independent sampling applies universally when in fact the reason that the anthropic approach worked is because we knew a-priori that the sampling was independent and random, thus we always get new information, so if this doesn't work (say because we know that certain outcomes are impossible or improbable), then a lot of the anthropic reasoning becomes invalid too.
I think that a lot of disagreements are truly hidden conflicts, but I also do think that a non-trivial portion of disagreements come down to not having common priors, which is necessary for the theorem to work for disagreements.
Would you give a summary of what you thought was mistaken in the post's read of the current evidence?
FYI, Apollo was mostly doing capabilities evaluations, and while I'm fine with this choice, this doesn't show that AIs will naturally escape, only that they can, and the discourse around this was very bad, and this happened once before on Apollo's evaluations (or was it the same eval):
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9
What do you mean by no expression that converges to the constant exists? Do you just mean an Turing-computable expression that converges to the number?
Technically speaking, at least for certain constants, our current model of the universe is that they are infinitely complicated, because we aren't promised that all of the constants are computable real numbers, though they are almost certainly not constants we could use to perform hypercomputation, and thus the universe is infinitely complex in the sense of Kolmogorov Complexity.
That said, I'm not sure whether physicists have come to a consensus that any future theory of quantum gravity must have real numbers that are computable.
Admittedly, a lot of the problem is that in the general public, a lot of AI optimism and pessimism is that stupid, and even in LW, there are definitely stupid arguments for optimism, so I think they have developed a wariness towards these sorts of arguments.
I basically agree with this being a good post mostly because it distills the core argument in a way such that we can make more productive progress on the issue.
IMO, AI safety has the problem that both a lot of the science of AI safety on how to make AIs safe is only partially known (but it has made progress), the evidence base for the AI field, especially on the big questions like deceptive alignment is way smaller than a lot of other fields (for several reasons), combined with your last point about incentives to get AI more powerful by companies.
Add them all up, and it's a tricky problem.
I think I disagree with 1 being all that likely; there are just other things I could see happening that would make a pause or stop politically popular (i.e. warning shots, An Inconvenient Truth AI Edition, etc.), likely not worth getting into here. I also think 'if we pause it will be for stupid reasons' is a very sad take.
I generally don't think the Inconvenient truth movie mattered that much for solving climate change, compared to technological solutions like renewable energy, and made the issue a little more partisan (though environmentalism/climate change was unusually partisan by then) and I think social movements to affect AI already had less impact on AI safety than technical work (in a broad sense) for reducing doom, and I expect this trend to continue.
I think warning shots could scare the public, but I worry that the level of warning shots necessary to clear AI is in a fairly narrow band, and I also expect AI control to have a reasonable probability of containing human-level scheming models that do work, so I wouldn't pick this at all.
I agree it's a sad take that "if we pause it will be for stupid reasons", but I also think this is the very likely attractor, if AI does become a subject that is salient in politics, because people hate nuance, and nuance matters way more than the average person wants to deal with on AI (For example, I think the second species argument critically misses important differences that make the human-AI relationship more friendly than the human-gorilla relationship, and that's without the subject being politicized).
To address this:
But I think there's a big gap between the capabilities you need for politically worrisome levels of unemployment, and the capabilities you need for an intelligence explosion, principally because >30 percent of human labor in developed nations could be automated with current tech if the economics align a bit (hiring 200+k/year ML engineers to replace your 30k/year call center employee is only just now starting to make sense economically). I think this has been true of current tech since ~GPT-4, and that we haven't seen a concomitant massive acceleration in capabilities on the frontier (things are continuing to move fast, and the proliferation is scary, but it's not an explosion).
I think the key crux is I believe that the unreliability of GPT-4 would doom any attempt to automate 30% of jobs, and I think at most 0-1% of jobs could be automated, and while in principle you could improve reliability without improving capabilities too much, I also don't think the incentives yet favor this option.
In general, I don't like collapsing the various checkpoints between here and superintelligence; there are all these intermediate states, and their exact features matter a lot, and we really don't know what we're going to get. 'By the time we'll have x, we'll certainly have y' is not a form of prediction that anyone has a particularly good track record making.
I agree with this sort of argument, and in general I am not a fan of collapsing checkpoints between today's AI and God AIs, which is a big mistake I think MIRI did, but my main claim is that the checkpoints would be illegible enough to the average citizen such that they don't notice the progress until it's too late, and that the reliability improvements will in practice also be coupled with capabilities improvements that matter to the AI explosion, but not very visible to the average citizen for the reason Garrison Lovely describes here:
https://x.com/GarrisonLovely/status/1866945509975638493
There's a vibe that AI progress has stalled out in the last ~year, but I think it's more accurate to say that progress has become increasingly illegible. Since 6/23, perf. on PhD level science questions went from barely better than random guessing to matching domain experts.
(More in the link above)
The example of independence/random sampling was one of my examples that is almost certainly an incorrect assumption that people use, which leads to both the doomsday argument, and it's also used for violating conservation of expected evidence in the Sleeping Beauty problem:
I actually disagree that there are no cycles/multiple paths to the same endpoint in the territory too.
In particular, I'm thinking of function extensionality, where multiple algorithms with wildly different run-times can compute the same function.
This is an easy source of examples where there are multiple starting points but there exists 1 end result (at least probabilistically).
I agree with you that it is quite bad that Roko didn't attempt to do this, and my steelmanning doesn't change the fact that the original argument is quite bad, and should be shored up.
The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
It's correct that understanding a value!= caring about the value in the general case, and this definitely should be fixed, but I think the defensible claim here is that the data absolutely influence which values you eventually adopt, and we do have ways to influence what an LLM values just by changing their datasets.
There are other nitpicks I will drop in short form: why assume "superhuman levels of loyalty" in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
As far as why we should assume superhuman levels of loyalty, the basic answer is that the second species arguments relies on premises that are crucially false for the AI case.
The big reason why gorillas/chimpanzees lost out and got brutally killed by humans when we dominated is because of us being made out of a ridiculously sparse RL process, which means we had barely any alignment effort by evolution or genetically close to human species and more importantly there was no gorilla/chimpanzee alignment effort at all, nor did they have the tools to control what our data sources are, unlike in the AI case where we both have way denser feedback and more control over their data sources, and we also have help from SGD for any inner alignment issue, which is way more powerful as an optimizer than evolution/natural selection, mostly due to not having very exploitable hacks.
I am more so rejecting the magical results that people take away from anthropic reasoning, especially when people use incorrect assumptions, and I'm basically rejecting anthropic reasoning as something that is irredeemably weird or otherwise violates Bayes.
I was pointing to this quote:
It sounds as though you're imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
and this comment, which talks about proliferating cases where 1 AI schemes into multiple instances to get more evidence:
https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/#BkdBD5psSFyMaeesS
I think one major difference between AIXI and it's variants and real world AI is to what extent they are trading off on compute/data, where real world AI will rely on way more data and way less compute than AIXI and it's variants, for very important reasons.
It's true in a perfect world that everyone would be concerned about the risks for which there are good reasons to be concerned, and everyone would be unconcerned about the risks for which there are good reasons to be unconcerned, because everyone would be doing object-level checks of everyone else’s object-level claims and arguments, and coming to the correct conclusion about whether those claims and arguments are valid, so I shouldn't have stated that the perfect world was ruined by that, but I consider this a fabricated option for reasons relating to how hard it is for average people to validate complex arguments, combined with the enormous economic benefits of specializing in a field, so I'm focused a lot more on what incentives does this give a real society, given our limitations.
To address this part:
I actually agree with this, and I agree with the claim that an existential risk can happen without leaving empirical evidence as a matter of sole possibility.
I have 2 things to say here:
- I am more optimistic that we can get such empirical evidence for at least the most important parts of the AI risk case, like deceptive alignment, and here's one reason as comment on offer:
https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/?commentId=T57EvmkcDmksAc4P4
2. From an expected value perspective, a problem can be both very important to work on and also have 0 tractability, and I think a lot of worlds where we get outright 0 evidence or close to 0 evidence on AI risk are also worlds where the problem is so intractable as to be effectively not solvable, so the expected value of solving the problem is also close to 0.
This also applies to the alien scenario: While from an epistemics perspective, it is worth it to consider the hypothesis that the aliens are unfriendly, from a decision/expected value perspective, almost all of the value is in the hypothesis that the aliens are friendly, since we cannot survive alien attacks except in very specific scenarios.
I expect the second-order effects of allowing people to get political power by crisis-mongering about risk when there is no demonstration/empirical evidence to ruin the initially perfect world pretty immediately, even assuming that AI risk is high and real, because this would allow anyone to make claims about some arbitrary risk and get rewarded for it even if it isn't true, and there's no force that systematically favors true claims over false claims about risk in this incentive structure.
Indeed, I think it would be a worse world than now, since it supercharges already existing incentives to crisis monger for the news industry and political groups.
Also, while Alan Turing and John Von Neumann were great computer scientists, I don't particularly have that much reason to elevate their AI risk opinions over anyone else on this topic, and their connection to AI is at best very indirect.
IMO, the entire Chinese room thought experiment dissolves into clarity once we remember that the intuitive meaning of understanding is formed around algorithms that are not lookup tables, because trying to create an infinite look-up table would be infeasible in reality, thus our intuitions go wrong in extreme cases.
I agree with the discord comments here on this point:
The portion of the argument where I contest is step 2 here (it's a summarized version):
- If Strong AI is true, then there is a program for Chinese, C, such that if any computing system runs C, that system thereby comes to understand Chinese.
- I could run C without thereby coming to understand Chinese.
- Therefore Strong AI is false.
Or this argument here:
Searle imagines himself alone in a room following a computer program for responding to Chinese characters slipped under the door. Searle understands nothing of Chinese, and yet, by following the program for manipulating symbols and numerals just as a computer does, he sends appropriate strings of Chinese characters back out under the door, and this leads those outside to mistakenly suppose there is a Chinese speaker in the room.
As stated, if the computer program is accessible to him, then for all intents and purposes, he does understand Chinese for the purposes of interacting until the program is removed (assuming that it completely characterizes Chinese and correctly works for all inputs).
I think the key issue is that people don't want to accept that if we were completely unconstrained physically, even very large look-up tables would be a valid answer to making an AI that is useful.
True, it's not that relevant.
You should probably also talk to computability/recursion theorists, who can put problems on scales of complexity mirroring the polynomial and exponential time hierarchies that define complexity theory.
A crux here is that I basically don't think Coherent Extrapolated Volition of humanity type alignment strategies work, and I also think that it is irrelevant that we can't align an AI to the CEV of humanity.
As a simple metric in the most general case, a program generating an output might get a 1 for providing the desired output, a zero for providing an incorrect output, and a -1 for not terminating within some finite time, or crashing. Given a program which terminates on all inputs, we could say there is some implicit “goal” output, which is described by the program behavior. In this case, all terminating programs are rational given their output as the goal, but (unfortunately for the current argument,) the fraction of programs which fulfill this property of terminating on some output is uncomputable[4].
The more interesting question is how uncomputable/how complicated is the problem actually is?
Is this the upper bound on complexity, or is it even more undecidable/complex?
I admit this is possible, so I almost certainly am overconfident here (which matters a little), though I believe a lot of common methods that do work for alignment also allow you to censor an AI.