Posts
Comments
I don't understand. You shouldn't get any changes from changing encoding if it produces the same proteins - the difference for mirror life is that it would also mirror proteins, etc.
Plausibly, yes. But so does programming capability, which is actually a bigger deal. (And it's unclear that a traditionally envisioned intelligence explosion is possible with systems built on LLMs, though I'm certainly not convinced by that argument.)
But "creating safety guarantees into which you can plug in AI capabilities once they arise anyway" is the point, and it requires at least some non-trivial advances in AI capabilities.
You should probably read the current programme thesis.
It is speculative in the sense that any new technology being developed is speculative - but closely related approaches are already used for assurance in practice, so provable safety isn't actually just speculative, there are concrete benefits in the near term. And I would challenge you to name a different and less speculative framework that actually deals with any issues of ASI risks that isn't pure hopium.
Uncharitably, but I think not entirely inaccurately, these include: "maybe AI can't be that much smarter than humans anyways," "let's get everyone to stop forever," "we'll use AI to figure it out, even though we have no real ideas," "we just will trust that no-one makes it agentic," "the agents will be able to be supervised by other AI which will magically be easier to align," "maybe multiple AIs will compete in ways that isn't a disaster," "maybe we can just rely on prosaic approaches forever and nothing bad happens," "maybe it will be better than humans at having massive amounts of unchecked power by default." These all certainly seem to rely far more on speculative claims, with far less concrete ideas about how to validate or ensure them.
It is critical for guaranteed safe AI and many non-prosaic alignment agendas. I agree it has risks, since all AI capabilities and advances pose control risks, but it seems better than most types of general capabilities investments.
Do you have a more specific model of why it might be negative?
I don't think it was betrayal, I think it was skipping verbal steps, which left intent unclear.
If A had said "I promised to do X, is it OK now if I do Y instead?" There would presumably have been no confusion. Instead, they announced, before doing Y, their plan, leaving the permission request implicit. The point that "she needed A to acknowledge that he’d unilaterally changed an agreement" was critical to B, but I suspect A thought that stating the new plan did that implicitly.
Strongly agree that there needs to be an institutional home. My biggest problem is that there is still no such new home!
You should also read the relevant sequence about dissolving the problem of free will: https://www.lesswrong.com/s/p3TndjYbdYaiWwm9x
You believe that something inert cannot be doing computation. I agree. But you seem to think it's coherent that a system with no action - a post-hoc mapping of states - can be.
The place where comprehension of Chinese exists in the "chinese room" is the creation of the mapping - the mapping itself is a static object, and the person in the room by assumption is doing to cognitive work, just looking up entries. "But wait!" we can object, "this means that the Chinese room doesn't understand Chinese!" And I think that's the point of confusion - repeating someone else telling you answers isn't the same as understanding. The fact that the "someone else" wrote down the answers changes nothing. The question is where and when the computation occurred.
In our scenarios, there are a couple different computations - but the creation of the mapping unfairly sneaks in the conclusion that the execution of the computation, which is required to build the mapping, isn't what creates consciousness!
Good point. The problem I have with that is that in every listed example, the mapping either requires the execution of the conscious mind and a readout of its output and process in order to build it, or it stipulates that it is well enough understood that it can be mapped to an arbitrary process, thereby implicitly also requiring that it was run elsewhere.
That seems like a reasonable idea. It seems not at all related to what any of the philosophers proposed.
For their proposals, it seems like the computational process is more like:
1. Extract a specific string of 1s and zeros from the sandstorm's initial position, and another from it's final position, with the some length as the length of the full description of the mind.
2. Calculate the bitwise sum of the initial mind state and the initial sand position.
3. Calculate the bitwise sum of the final mind state and the final sand position.
4. Take the output of state 2 and replace it with the output of state 3.
5. Declare that the sandstorm is doing something isomorphic to what the mind did. Ignore the fact that the internal process is completely unrelated, and all of the computation was done inside of the mind, and you're just copying answers.
I agree that's a more interesting question, and computational complexity theorists have done work on it which I don't fully understand, but it also doesn't seem as relevant for AI safety questions.
Regarding Chess agents, Vanessa pointed out that while only perfect play is optimal, informally we would consider agents to have an objective that is better served by slightly better play, for example, an agent rated 2500 ELO is better than one rated 1800, which is better than one rated 1000, etc. That means that lots of "chess minds" which are non-optimal are still somewhat rational at their goal.
I think that it's very likely that even according to this looser definition, almost all chess moves, and therefore almost all "possible" chess bots, fail to do much to accomplish the goal.
We could check this informally by evaluating the set of possible moves in normal games would be classified as blunders, using a method such as the one used here to evaluate what proportion of actual moves made by players are blunders. Figure 1 there implies that in positions with many legal moves, a larger proportion are blunders - but this is looking at the empirical blunder rate by those good enough to be playing ranked chess. Another method would be to look at a bot that actually implements "pick a random legal move" - namely Brutus RND. It has an ELO of 255 when ranked against other amateur chess bots, and wins only occasionally against some of the worst bots; it seems hard to figure out from that what proportion of moves are good, but it's evidently a fairly small proportion.
We earlier mentioned that it is required that the finite mapping be precomputed. If it is for arbitrary Turing machines, including those that don't halt, we need infinite time, so the claim that we can map to arbitrary Turing machines fails. If we restrict it to those which halt, we need to check that before providing the map, which requires solving the halting problem to provide the map.
Edit to add: I'm confused why this is getting "disagree" votes - can someone explain why or how this is an incorrect logical step, or
OK, so this is helpful, but if I understood you correctly, I think it's assuming too much about the setup. For #1, in the examples we're discussing, the states of the object aren't predictably changing in complex ways - just that it will change "states" in ways that can be predicted to follow a specific path, which can be mapped to some set of states. The states are arbitrary, and per the argument don't vary in some way that does any work - and so as I argued, they can be mapped to some set of consecutive integers. But this means that the actions of the physical object are predetermined in the mapping.
And the difference between that situation and the CNS is that we know he neural circuitry is doing work - the exact features are complex and only partly understood, but the result is clearly capable of doing computation in the sense of Turing machines.
I think this was a valuable post, albeit ending up somewhat incorrect about whether LLMs would be agentic - not because they developed the capacity on their own, but because people intentionally built and are building structure around LLMs to enable agency. That said, the underlying point stands - it is very possible that LLMs could be a safe foundation for non-agentic AI, and many research groups are pursuing that today.
The blogpost this points to was an important contribution at the time, more clearly laying out extreme cases for the future. (The replies there were also particularly valuable.)
I think this post makes an important and still neglected claim that people should write their work more clearly and get it published in academia, instead of embracing the norms of the narrower community they interact with. There has been significant movement in this direction in the past 2 years, and I think this posts marks a critical change in what the community suggests and values in terms of output.
"the actual thinking-action that the mapping interprets"
I don't think this is conceptually correct. Looking at the chess playing waterfall that Aaronson discusses, the mapping itself is doing all of the computation. The fact that the mapping ran in the past doesn't change the fact that it's the location of the computation, any more than the fact that it takes milliseconds for my nerve impulses to reach my fingers means that my fingers are doing the thinking in writing this essay. (Though given the typos you found, it would be convenient to blame them.)
they assume ad arguendo that you can instantiate the computations we're interested in (consciousness) in a headful of meat, and then try to show that if this is the case, many other finite collections of matter ought to be able to do the job just as well.
Yes, they assume that whatever runs the algorithm is experiencing running the algorithm from the inside. And yes, many specific finite systems can do so - namely, GPUs and CPUs, as well as the wetware in our head. But without the claim that arbitrary items can do these computations, it seems that the arguendo is saying nothing different than the conclusion - right?
Looks like I messed up cutting and pasting - thanks!
Thanks - fixed!
Yeah, perhaps refuting is too strong given that the central claim is that we can't know what is and is not doing computation - which I think is wrong, but requires a more nuanced discussion. However, the narrow claims they made inter-alia were strong enough to refute, specifically by showing that their claims are equivalent to saying the integers are doing arbitrary computation - when making the claim itself requires the computation to take place elsewhere!
Seems worth noting that the claims of most of the philosophers being cited here is (1) - that even rocks are doing the same computation as minds.
I agree that this wasn't intended as an introduction to the topic. For that, I will once again recommend Scott Aaronson's excellent mini-book explaining computational complexity to philosophers.
I agree that the post isn't a definition of what computation is - but I don't need to be able to define fire to be able to point out something that definitely isn't on fire! So I don't really understand your claim. I agree that it's objectively hard to interpret computation, but it's not at all hard to interpret the fact that the integers are less complex and doing less complex computation than, say, an exponential-time Turing machine - and given the specific arguments being made, neither is a wall or a bag of popcorn. Which, as I just responded to the linked comment, was how I understood the position being taken by Searle, Putnam, and Johnson. (And even this ignores that one implication of the difference in complexity is that the wall / bag of popcorn / whatever is not mappable to arbitrary computations, since the number of steps required for a computation may not be finite!)
I've written my point more clearly here: https://www.lesswrong.com/posts/zxLbepy29tPg8qMnw/refuting-searle-s-wall-putnam-s-rock-and-johnson-s-popcorn
I think 'we estimate... to be'
Your/Aaronson's claim is that only the fully connected, sensibly interacting calculation matters.
Not at all. I'm not making any claim about what matters or counts here, just pointing out a confusion in the claims that were made here and by many philosophers who discussed the topic.
You disagree with Aaronson that the location of the complexity is in the interpreter, or you disagree that it matters?
In the first case, I'll defer to him as the expert. But in the second, the complexity is an internal property of the system! (And it's a property in a sense stronger than almost anything we talk about in philosophy; it's not just a property of the world around us, because as Gödel and others showed, complexity is a necessary fact about the nature of mathematics!)
Yeah, something like that. See my response to Euan in the other reply to my post.
Yes, and no, it does not boil down to Chalmer's argument. (as Aaronson makes clear in the paragraph before the one you quote, where he cites the Chalmers argument!) The argument from complexity is about the nature and complexity of systems capable of playing chess - which is why I think you need to carefully read the entire piece and think about what it says.
But as a small rejoinder, if we're talking about playing a single game, the entire argument is ridiculous; I can write the entire "algorithm" a kilobyte of specific instructions. So it's not that an algorithm must be capable of playing multiple counterfactual games to qualify, or that counterfactuals are required for moral weight - it's that the argument hinges on a misunderstanding of how complex different classes of system need to be to do the things they do.
PS. Apologies that the original response comes off as combative - I really think this discussion is important, and wanted to engage to correct an important point, but have very little time to do so at the moment!
As with OP, I strongly recommend Aaronson, who explains why waterfalls aren't doing computation in ways that refute the rock example you discuss: https://www.scottaaronson.com/papers/philos.pdf
You seem to fundamentally misunderstand computation, in ways similar to Searle. I can't engage deeply, but recommend Scott Aaronson's primer on computational complexity: https://www.scottaaronson.com/papers/philos.pdf
You seem deeply confused about computation, in ways similar to Searle et al. I cannot engage deeply on this at present, but recommend Aaronson's primer on the topic: https://www.scottaaronson.com/papers/philos.pdf
Norms can accomplish this as well - I wrote about this a couple weeks ago.
Are you familiar with Davidad's program working on compositional world modeling? (The linked notes are from before the program was launched, there is ongoing work on the topic.)
The reason I ask is because embedded agents and agents in multi-agent settings should need compositional world models that include models of themselves and other agents, which implies that hierarchical agency is included in what they would need to solve.
It also relates closely to work Vanessa is doing (as an "ARIA Creator") in learning theoretic AI, related to what she has called "Frugal Compositional Languages" and see this work by @alcatal - though I understand both are not yet addressing on multi-agent world models, nor is it explicitly about modeling the agents themselves in a compositional / embedded agent way, though those are presumably desiderata.
That is an interesting question l, but I unfortunately do not know enough to even figure out how to answer it.
Good points. Yes, storage definitely helps, and microgrids are generally able to have some storage, if only to smooth out variation in power generation for local use. But solar storms can last days, even if a large long-lasting event is very, very unlikely. And it's definitely true that if large facilities have storage, shutdowns will have reduced impact - but I understand that the transformers are used for power transmission, so having local storage at the large generators won't change the need to shut down the transformers used for sending that power to consumers.
Do I understand correctly that the blue-green graph has a y-axis that goes above 100% median reduction, with error bars in that range? (This would happen if they estimated a proportion as a standard variable - not great practice, but I want to check that it is what happened.)
Question for a lawyer: how is non-reciprocity not an interstate trade issue that federal courts can strike down?
In addition to the point that current models are already strongly superhuman in most ways, I think that if you buy the idea that we'll be able to do automated alignment of ASI, you'll still need some reliable approach to "manual" alignment of current systems. We're already far past the point where we can robustly verify LLMs claims' or reasoning in a robust fashion outside of narrow domains like programming and math.
But on point two, I strongly agree that Agent foundations and Davidad's agendas are also worth pursuing. (And in a sane world, we should have tens or hundreds of millions of dollars in funding for each every year.) Instead, it looks like we have Davidad's ARIA funding, Jaan Talinn and LTFF funding some agent foundations and SLT work, and that's basically it. And MIRI abandoned agent foundations, while Openphil isn't, it seems, putting money or effort into them.
I partly disagree; steganography is only useful when it's possible for the outside / receiving system to detect and interpret the hidden messages, so if the messages are of a type that outside systems would identify, they can and should be detectable by the gating system as well.
That said, I'd be very interested in looking at formal guarantees that the outputs are minimally complex in some computationally tractable sense, or something similar - it definitely seems like something that @davidad would want to consider.
I really like that idea, and the clarity it provides, and have renamed the post to reflect it! (Sorryr this was so slow- I'm travelling.)
That seems fair!
I agree that in the most general possible framing, with no restrictions on output, you cannot guard against all possible side-channels. But that's not true for proposals like safeguarded AI, where a proof must accompany the output, and it's not obviously true if the LLM is gated by a system that rejects unintelligible or not-clearly-safe outputs.
On the absolute safety, I very much like the way you put it, and will likely use that framing in the future, so thanks!
On impossibility results, there are some, andI definitely think that this is a good question, but also agree this isn't quite the right place to ask. I'd suggest talking to some of the agents foundations people for suggestions
I think these are all really great things that we could formalize and build guarantees around. I think some of them are already ruled out by the responsibility sensitive safety guarantees, but others certainly are not. On the other hand, I don't think that use of cars to do things that violate laws completely unrelated to vehicle behavior are in scope; similar to what I mentioned to Oliver, if what is needed in order for a system to be safe is that nothing bad can be done, you're heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes.
But in cases where provable safety guarantees are in place, and the issues relate to car behavior - such as cars causing damage, blocking roads, or being redirected away from the intended destination - I think hardware guarantees on the system, combined with software guarantees, combined with verifying that only trusted code is being run, could be used to ignition-lock cars which have been subverted.
And I think that in the remainder of cases, where cars are being used for dangerous or illegal purposes, we need to trade off freedom and safety. I certainly don't want AI systems which can conspire to break the law - and in most cases, I expect that this is something LLMs can already detect - but I also don't want a car which will not run if it determines that a passenger is guilty of some unrelated crime like theft. But for things like "deliver explosives or disperse pathogens," I think vehicle safety is the wrong path to preventing dangerous behavior; it seems far more reasonable to have separate systems that detect terrorism, and separate types of guarantees to ensure LLMs don't enable that type of behavior.
Yes, after saying it was about what they need "to do not to cause accidents" and that "any accidents which could occur will be attributable to other cars actions," which I then added caveats to regarding pedestrians, I said "will only have accidents" when I should have said "will only cause accidents." I have fixed that with another edit. But I think you're confused about what I'm trying to show .
Principally, I think you are wrong about what needs to be shown here for safety in the sense I outlined, or are trying to say that the sense I outlined doesn't lead to something I don't claim. If what is needed in order for a system to be safe is that no damage will be caused in situations which involve the system, you're heading in the direction of a claim that the only safe AI is a universal dictator that has sufficient power to control all outcomes. My claim, on the other hand, is that in sociotechnological systems, the way that safety is achieved is by creating guarantees that each actor - human or AI - behaves according to rules that minimizes foreseeable dangers. That would include safeguards for stupid, malicious, or dangerous human actions, much like human systems have laws about dangerous actions. However, in a domain like driving, in the same way that it's impossible for human drivers to both get where they are going, and never hit pedestrians who act erratically and jump out from behind obstacles into the road with an oncoming car, a safe autonomous vehicle wouldn't be expected to solve every possible case of human misbehavior - just to drive responsibly.
More specifically, you make the claim that "as far as I can tell it would totally be compatible with a car driving extremely recklessly in a pedestrian environment due to making assumptions about pedestrian behavior that are not accurate." The paper, on the other hand, says "For example, in a typical residential street, a pedestrian has the priority over the vehicles, and it follows that vehicles must yield and be cautious with respect to pedestrians," and formalizes this with statements like "a vehicle must be in a kinematic state such that if it will apply a proper response (acceleration for ρ seconds and when braking) it will remain outside of a ball of radius 50cm around the pedestrian."
I also think that it formalizes reasonable behavior for pedestrians, but I agree that it won't cover every case - pedestrians oblivious to cars that are driving in ways that are otherwise safe, who rapidly change their path to jump in front of cars, are sometimes able to be hit by those cars - but I think fault is pretty clear here. (And the paper is clear that even in those cases, the car would need to both drive safely in residential areas, and attempt to brake or avoid the pedestrian in order to avoid crashes even in cases with irresponsible and erratic humans!)
But again, as I said initially, this isn't solving the general case of AI safety, it's solving a much narrower problem. And if you wanted to make the case that this isn't enough for similar scenarios that we care about, I will strongly agree that for more capable systems, the set of situations it would need to avoid are correspondingly larger, and the set of necessary guarantees are far stronger. But as I said at the beginning, I'm not making that argument - just the much simpler one that proveability can work in physical systems, and can be applied in sociotechnological systems in ways that make sense.
I agree that "safety in an open world cannot be proved," at least as a general claim, but disagree that this impinges on the narrow challenge of designing cars that do not cause accidents - a misunderstanding which I tried to be clear about, but which I evidently failed to make sufficiently clear, as Oliver's misunderstanding illustrates. That said, I strongly agree that better methods for representing grain of truth problems, and considering hypotheses outside those which are in the model is critical. It's a key reason I'm supporting work on infra-Bayesian approaches, which are designed explicitly to handle this class of problem. Again, it's not necessary for the very narrow challenge I think I addressed above, but I certainly agree that it's critical.
Second, I'm a huge proponent of complex system engineering approaches, and have discussed this in previous unrelated work. I certainly agree that these issues are critical, and should receive more attention - but I think it's counterproductive to try to embed difficult problems inside of addressable ones. To offer an analogy, creating provably safe code that isn't vulnerable to any known technical exploit still will not prevent social engineering attacks, but we can still accomplish the narrow goal.
If, instead of writing code that can't be fuzzed for vulnerabilities, doesn't contain buffer overflow or null-pointer vulnerabilities, and can't be exploited via transient execution CPU vulnerabilities, and isn't vulnerable to rowhammer attacks, you say that we need to address social engineering before trying to make the code provably safe, and should address social engineering with provable properties, you're sabotaging progress in a tractable area in order to apply a paradigm ill-suited to the new problem you're concerned with.
That's why, in this piece, I started by saying I wasn't proving anything general, and "I am making far narrower claims than the general ones which have been debated." I agree that the larger points are critical. But for now, I wanted to make a simpler point.
To start at the end, you claim I "straightforwardly made an inaccurate unqualified statement," but replaced my statement about "what a car needs to do not to cause accidents" with "no accidents will take place." And I certainly agree that there is an "extremely difficult and crucial step of translating a form toy world like RSS into real world outcomes," but the toy model that the paper is dealing with is therefore one of rule-following entities, both pedestrians and cars. That's why it's not going to require accounting for "what if pedestrians do something illegal and unexpected."
Of course, I agree that this drastically limits the proof, or as I said initially, "relying on assumptions about other car behavior is a limit to provable safety," but you seem to insist that because the proof doesn't do something I never claimed it did, it's glossing over something.
That said, I agree that I did not discuss pedestrians, but as you sort-of admit, the paper does - it treats stationary pedestrians not at crosswalks, and not on sidewalks, as largely unpredictable entities that may enter the road. For example, it notes that "even if pedestrians do not have priority, if they entered the road at a safe distance, cars must brake and let them pass." But again, you're glossing over the critical assumption for the entire section, which is responsibility for accidents. And this is particularly critical; the claim is not that pedestrians and other cars cannot cause accidents, but that the safe car will not do so.
Given all of that, to get back to the beginning, your initial position was that "RSS seems miles away from anything that one could describe as a formalization of how to avoid an accident." Do you agree that it's close to "a formalization of how to avoid causing an accident"?
Have you reviewed the paper? (It is the first link under "The RSS Concept" in the page which was linked to before, though perhaps I should have linked to it directly.) It seems to lay out the proof, and discusses pedestrians, and deals with most of the objections you're raising, including obstructions and driving off of marked roads. I admit I have not worked through the proof in detail, but I have read through it, and my understanding is that it was accepted, and a large literature has been built that extends it.
And the objections about slippery roads and braking are the set of things I noted under "traditional engineering analysis and failure rates" I agree that guarantees are non-trivial, but they also aren't outside of what is done already in safety analysis, and there is explicit work in the literature on the issue, both from the verification and validation side, and from the perception and sensing weather conditions side.