Posts
Comments
Finding non-equilibrium quantum states would be evidence of pilot wave theory since they're only possible in a pilot wave theory.
If you can find non-equilibrium quantum states, they are distinguishable. https://en.m.wikipedia.org/wiki/Quantum_non-equilibrium
(Seems pretty unlikely we'd ever be able to definitively say a state was non-equilibrium instead of some other weirdness, though.)
I can help confirm that your blind assumption is false. Source: my undergrad research was with a couple of the people who have tried hardest, which led to me learning a lot about the problem. (Ward Struyve and Samuel Colin.) The problem goes back to Bell and has been the subject of a dedicated subfield of quantum foundations scholars ever since.
This many years distant, I can't give a fair summary of the actual state of things. But a possibly unfair summary based on vague recollections is: it seems like the kind of situation where specialists have something that kind of works, but people outside the field don't find it fully satisfying. (Even people in closely adjacent fields, i.e. other quantum foundations people.) For example, one route I recall abandons using position as the hidden variable, which makes one question what the point was in the first place, since we no longer recover a simple manifest image where there is a "real" notion of particles with positions. And I don't know whether the math fully worked out all the way up to the complexities of the standard model weakly coupled to gravity. (As opposed to, e.g., only working with spin-1/2 particles, or something.)
Now I want to go re-read some of Ward's papers...
This is great, until Spotify is ready this will be the best way to share on social media.
May I suggest adding lyrics, either in the description or as closed captions or both?
If you are willing to share, can you say more about what got you into this line of investigation, and what you were hoping to get out of it?
For my part, I don't feel like I have many issues/baggage/trauma, so while some of the "fundamental debugging" techniques discussed around here (like IFS or meditation) seem kind of interesting, I don't feel too compelled to dive in. Whereas, techniques like TYCS or jhana meditation seem more intriguing, as potential "power ups" from a baseline-fine state.
So I'm wondering if your baseline is more like mine, and you ended up finding fundamental debugging valuable anyway.
It seems we have very different abilities to understand Holtman's work and find it intuitive. That's fair enough! Are you willing to at least engage with my minimal-time-investment challenge?
The agent is indifferent between creating stoppable or unstoppable subagents, but the agent goes back to being corrigible in this way. The "emergent incentive" handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we're commenting on are prepared to tackle, although it is an interesting followup work.
I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor agents, the agent does not do so! (Figure 11) If you believe it doesn't work, you must also believe there's a bug in the simulation, or some mis-encoding of the problem in the simulation. Working that out, either by forking his code or by working out an example on paper, would be worthwhile. (Forking his code is not recommended, as it's in Awk; I have an in-progress reimplementation in optimized-for-readability TypeScript which might be helpful if I get around to finishing it. But especially if you simplify the problem to a 2-step setting like your post, computing his correction terms on paper seems very doable.)
I agree with the critique that some patches are unsatisfying. I'm not sure how broadly you are applying your criticism, but to me the ones involving constant offsets (7.2 and 8.2) are not great. However, at least for 7.2, the paper clarifies what's going on reasonably well: the patch is basically environment-dependent, and in the limit where your environment is unboundedly hostile (e.g., an agent controls unbounded utility and is willing to bribe you with it) you're going to need an unbounded offset term.
I found that the paper's proof was pretty intuitive and distilled. I think it might be for you as well if you did a full reading.
At a meta-level, I'd encourage you to be a bit more willing to dive into this work, possibly including the paper series it's part of. Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we're commenting on. He's given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable. (Notably, the simulation helps make it obvious how Sores et al.'s approach has critical mathematical mistakes and cannot be implemented; see appendix C.) The followup papers, which I'm still working through, port the result to various other paradigms such as causal influence diagrams. Attempting to start this field over as if there's been no progress on the shutdown problem since Sores et al. seems... wasteful at best, and hubristic at worst.
If you want to minimize time investment, then perhaps the following is attractive. Try to create a universe specification similar to that of Holtman's paper, e.g. world state, available actions, and utility function before and after shutdown as a function of the world state, such that you believe that Holtman's safety layer does not prevent the agent from taking the "create an unstoppable sub-agent" action. I'll code it up, apply the correction term, and get back to you.
Are you aware of how Holtman solved MIRI's formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic
Skimming through your proposal, I believe Holtman's correctly-constructed utility function correction terms would work for the scenario you describe, but it's not immediately obvious how to apply them once you jump to a subagent model.
This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.
I'm most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).
But, what does this failure mean? I'm not sure I understand the authors' conclusions: they state (3.2.3) this "suggests that models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning." I don't see any evidence of that in the paper!
In particular, 3.2.3 and figure 7's categorization of errors, as well as the theoretical results they discuss in section 4, gives me the opposite impression. Basically they say that if you make a local error, it'll propagate and screw you up. You can see, e.g., in figure 7's five-shot GPT-4 example, how a single local error at graph layer 1 causes propagation error to start growing immediately. Later more local errors kick in, but to me this is sort of understandable: once the calculation starts going off the rails, the model might not be in a good place to do even local reasoning.
I don't see what any of this has to do with planning and composing! In particular I don't see any measurement of something like "set up a totally wrong plan for multiplying numbers" or "fail to compose all the individual digit computation-steps into the final answer-concatenation step". Such errors might exist, but the paper doesn't give examples or any measurements of them. Its categorization of error types seems to assume that the model always produces a computation graph, which to me is pretty strong evidence of planning and composing abilities!
Stated another way: I suspect that if you eliminated all the local errors, accuracy would be good! So the question is: why is GPT-4 failing to multiply single-digit numbers sometimes, in the middle of these steps?
(It's possible the answer lies in tokenization difficulties, but it seems unlikely.)
OK, now let's look at it from another angle: how different is this from humans? What's impressive to me about this result is that it is quite different. I was expecting to say something like, "oh, not every human will be able to get 100% accuracy on following the multiplication algorithm for 5-digit-by-5-digit numbers; it's OK to expect some mistakes". But, GPT-4 fails to multiply 5x5 digit numbers every time!! Even with a scratchpad! Most educated humans would get better than zero accuracy.
So my overall takeaway is that local errors are still too prevalent in these sorts of tasks. Humans don't always >=1 mistake on { one-digit multiplication, sum, mod 10, carry over, concatenation } when performing 5-digit by 5-digit multiplication, whereas GPT-4 supposedly does.
Am I understanding this correctly? Well, it'd be nice to reproduce their results to confirm. If they're to be believed, I should be able to ask GPT-4 to do one of these multiplication tasks with a scratchpad, and always find an error in the middle. But when trying to reproduce their results, I ran into an issue of under-documented methodology (how did they compose the prompts?) and non-published data (what inaccurate things did the models actually say?). Filed on GitHub; we'll see if they get back to me.
Regarding grokking, they attempt to test whether GPT-3 finetuned on these sorts of problems will exhibit grokking. However, I'm skeptical of this attempt: they trained for 60 epochs for zero-shot and 40 epochs with scratchpads. Whereas the original grokking paper used between 3,571 epochs and 50,000 epochs.
(I think epochs is probably a relevant measure here, instead of training steps. This paper does 420K and 30K steps whereas the original grokking paper does 100K steps, so if we were comparing steps it seems reasonable. But "number of times you saw the whole data set" seems more relevant for grokking, in my uninformed opinion!)
Has anyone actually seen LLMs (not just transformers) exhibit grokking? A quick search says no.
I've had a hard time connecting John's work to anything real. It's all over Bayes nets, with some (apparently obviously true https://www.lesswrong.com/posts/2WuSZo7esdobiW2mr/the-lightcone-theorem-a-better-foundation-for-natural?commentId=K5gPNyavBgpGNv4m3 ) theorems coming out of it.
In contrast, look at work like Anthropic's superposition solution, or the representation engineering paper from CAIS. If someone told me "I'm interested in identifying the natural abstractions AIs use when producing their output", that is the kind of work I'd expect. It's on actual LLMs! (Or at least "LMs", for the Anthropic paper.) They identified useful concepts like "truth-telling" or "Arabic"!
In John's work, his prose often promises he'll point to useful concepts like different physics models, but the results instead seem to operate on the level of random variables and causal diagrams. I'd love to see any sign this work is applicable toward real-world AI systems, and can, e.g., accurately identify what abstractions GPT-2 or LLaMA are using.
Interesting post. One small area where I might have a useful insight:
A lot of online multiplayer games rest on the appeal of their character design. Think of Smash Bros, Overwatch, or League of Legends. Characters' unique abilities give rise to a dense hypergraphs of strategic relationships which players will want to learn the whole of.
But in these games, a character cannot have unique motivations. They'll have a backstory that alludes to some, but in the game, that will be forgotten. Instead, every mind will be turned towards just one game and one goal: Kill the other team, whoever they are. MostPointsWins forbids the expression of the most interesting dimensions of personality.
So imagine how much richer expressions of character could be if you had this whole other dimension of gameplay design to work with. That would be cohabitive.
Role-playing games, including online multiplayer RPGS ("MMORPGs"), often achieve this. In SWTOR, when I play my Empire-hating fallen-to-the-dark-side Jedi Knight, versus my heart-of-gold bounty hunter, or my murderously-insane Sith Inquisitor, I make very different choices even when faced with the same content. This is most-obviously manifest in the dialogue tree choices for the main story, but extends to other aspects of gameplay as well. Whether I take a stealth route or go on a murderous rampage; whether my character does all the side missions or jumps straight to the main objective; which companions I choose to bring along; and even what clothes I wear.
This role playing mostly does go out the window in the "most intense" parts of the game (PvP and raids). Although sometimes over voice chat we'll briefly slip into character for fun, or compliment each other on a particularly on-point use of abilities. ("Just like a smuggler, to stun all the trash.")
One key ingredient here is strong archetypes from an existing background (e.g. the Star Wars galaxy in my case, or the Warcraft universe in others, or maybe just fantasy in general). Another is the long-lasting investment and relationship one builds with their character, from design onward; I feel a lot more connection to my SWTOR characters than I do to the random personalities I pick up in any given Betrayal at House on the Hill session.
Anyway, maybe there's something to learn here for your game design!
I wonder if more people would join you on this journey if you had more concrete progress to show so far?
If you're trying to start something approximately like a new field, I think you need to be responsible for field-building. The best type of field-building is showing that the new field is not only full of interesting problems, but tractable ones as well.
Compare to some adjacent examples:
- Eliezer had some moderate success building the field of "rationality", mostly though explicit "social" field-building activities like writing the sequences or associated fanfiction, or spinning off groups like CFAR. There isn't much to show in terms of actual results, IMO; we haven't developed a race of Jeffreysai supergeniuses who can solve quantum gravity in a month by sufficiently ridding themselves of cognitive biases. But the social field-building was enough to create a great internet social scene of like-minded people.
- MIRI tried to kickstart a field roughly in the cluster of theoretical alignment research, focused around topics like "how to align AIXI", decision theories, etc. In terms of community, there are a number of researchers who followed in these footsteps, mostly at MIRI itself to my knowledge, but also elsewhere. (E.g. I enjoy @Koen.Holtman's followup work such as Corrigibility with Utility Preservation.) In terms of actual results, I think we see a steady stream of papers/posts showing slow-but-legible progress on various sub-agendas here: infra-bayesianism, agent foundations, corrigibility, natural abstractions, etc. Most (?) results seem to be self-published to miri.org, the Alignment Forum, or the arXiv, and either don't attempt or don't make it past peer review. So those who are motivated to join a field by legible incentives such as academic recognition and acceptance are often not along for the ride. But it's still something.
- "Mainstream" AI alignment research, as seen by the kind of work published by OpenAI, Anthropic, DeepMind, etc., has taken a much more conventional approach. People in this workstream are employed at large organizations that pay well; they publish in peer-reviewed journals and present at popular conferences. Their work often has real-world applications in aligning or advancing the capabilities of products people use.
In contrast, I don't see any of this sort of field-building work from you for meta-philosophy. Your post history doesn't seem to be trying to do social field-building, nor does it contain published results that could make others sit up and take notice of a tractable research agenda they could join. If you'd spent age 35-45 publishing a steady stream of updates and progress reports on meta-philosophy, I think you'd have gathered at least a small following of interested people, in the same way that the theoretical alignment research folk have. And if you'd used that time to write thousands of words of primers and fanfic, maybe you could get a larger following of interested bystanders. Maybe there's even something you could have done to make this a serious academic field, although that seems pretty hard.
In short, I like reading what you write! Consider writing more of it, more often, as a first step toward getting people to join you on this journey.
I continue to be intrigued about the ways modern powerful AIs (LLMs) differ from the Bostrom/Yudkowsky theorycrafted AIs (generally, agents with objective functions, and sometimes specifically approximations of AIXI). One area I'd like to ask about is corrigibility.
From what I understand, various impossibility results on corrigibility have been proven. And yet, GPT-4 is quite corrigible. (At the very least, in the sense that if you go unplug it, it won't stop you.) Has anyone analyzed which preconditions of the impossibility results have been violated by GPT-N? Do doomers have some prediction for how GPT-N for N >= 5 will suddenly start meeting those preconditions?
I am sympathetic to this viewpoint. However I think there are large-enough gains to be had from "just" an AI that: matches genius-level humans; has N times larger working memory; thinks N times faster; has "tools" (like calculators or Mathematica or Wikipedia) integrated directly into it's "brain"; and is infinitely copyable. That gets you to https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message territory, which is quite x-risky.
Working memory is a particularly powerful intuition pump here, I think. Given that you can hold 8 items in working memory, it feels clear that something which can hold 800 or 8 million such items would feel qualitatively impressive. Even if it's just a quantitative scale-up.
You can then layer on more speculative things from the ability to edit the neural net. E.g., for humans, we are all familiar with flow states where we are at our best, most-well-rested, most insightful. It's possible that if your brain is artificial, reproducing that on demand is easier. The closest thing we have to this today is prompt engineering, which is a very crude way of steering AIs to use the "smarter" path when it's navigating through their stack of transformer layers. But the existence of such paths likely means we, or an introspective AI, could consciously promote their use.
I don't think the lack of machine translations of AI alignment materials is holding the field back in Japan. Japanese people have an unlimited amount of that already available. "Doing it for them", when their web browser already has the feature built in, seems honestly counterproductive, as it signals how little you're willing to invest in the space.
I think it's possible increasing the amount of human-translated material could make a difference. (Whether machines are good enough to aid such humans or not, is a question I leave to the professional translators.)
Yes, I would really appreciate that. I find this approach compelling the abstract but what does it actually cache out in?
My best guess is that it means lots of mechanistic interpretability research, identifying subsystems of LLMs (or similar) and trying to explain them, until eventually they're made of less and less Magic. That sounds good to me! But what directions sound promising there? E.g. the only result in this area I've done a deep dive on, Transformers learn in-context by gradient descent, is pretty limited as it only gets a clear match for linear (!) single-layer (!!) regression models, not anything like a LLM. How much progress does Conjecture expect to really make? What are other papers our study group should read?
Relatedly, CoEms could be run at potentially high speed-ups, and many copies or variations could be run together. So we could end up in the classic scenario of a smarter-than-average "civilization", with "thousands of years" to plan, that wants to break out of the box.
This still seems less existentially risky, though, if we end up in a world where the CoEms retain something approximating human values. They might want to break out of the box, but probably wouldn't want to commit species-cide on humans.
These are the most compelling-to-me quotes from "Simulators", saved for posterity.
Perhaps it shouldn’t be surprising if the form of the first visitation from mindspace mostly escaped a few years of theory conducted in absence of its object.
…when AI is all of a sudden writing viral blog posts, coding competitively, proving theorems, and passing the Turing test so hard that the interrogator sacrifices their career at Google to advocate for its personhood, a process is clearly underway whose limit we’d be foolish not to contemplate.
GPT-3 does not look much like an agent. It does not seem to have goals or preferences beyond completing text, for example. It is more like a chameleon that can take the shape of many different agents. Or perhaps it is an engine that can be used under the hood to drive many agents. But it is then perhaps these systems that we should assess for agency, consciousness, and so on.
It would not be very dignified of us to gloss over the sudden arrival of artificial agents often indistinguishable from human intelligence just because the policy that generates them “only cares about predicting the next word”.
There is a clear sense in which the network doesn’t “want” what the things that it simulates want, seeing as it would be just as willing to simulate an agent with opposite goals, or throw up obstacles which foil a character’s intentions for the sake of the story. The more you think about it, the more fluid and intractable it all becomes. Fictional characters act agentically, but they’re at least implicitly puppeteered by a virtual author who has orthogonal intentions of their own.
Everything can be trivially modeled as a utility maximizer, but for these reasons, a utility function is not a good explanation or compression of GPT’s training data, and its optimal predictor is not well-described as a utility maximizer. However, just because information isn’t compressed well by a utility function doesn’t mean it can’t be compressed another way.
Gwern has said that anyone who uses GPT for long enough begins to think of it as an agent who only cares about roleplaying a lot of roles.
That framing seems unnatural to me, comparable to thinking of physics as an agent who only cares about evolving the universe accurately according to the laws of physics.
That said, GPT does store a vast amount of knowledge, and its corrigibility allows it to be cajoled into acting as an oracle, like it can be cajoled into acting like an agent.
Treating GPT as an unsupervised implementation of a supervised learner leads to systematic underestimation of capabilities, which becomes a more dangerous mistake as unprobed capabilities scale.
Benchmarks derived from supervised learning test GPT’s ability to produce correct answers, not to produce questions which cause it to produce a correct answer down the line. But GPT is capable of the latter, and that is how it is the most powerful.
Natural language has the property of systematicity: “blocks”, such as words, can be combined to form composite meanings. The number of meanings expressible is a combinatorial function of available blocks. A system which learns natural language is incentivized to learn systematicity; if it succeeds, it gains access to the combinatorial proliferation of meanings that can be expressed in natural language. What GPT lets us do is use natural language to specify any of a functional infinity of configurations, e.g. the mental contents of a person and the physical contents of the room around them, and animate that. That is the terrifying vision of the limit of prediction that struck me when I first saw GPT-3’s outputs.
Autonomous text-processes propagated by GPT, like automata which evolve according to physics in the real world, have diverse values, simultaneously evolve alongside other agents and non-agentic environments, and are sometimes terminated by the disinterested “physics” which governs them.
Calling GPT a simulator gets across that in order to do anything, it has to simulate something, necessarily contingent, and that the thing to do with GPT is to simulate! Most published research about large language models has focused on single-step or few-step inference on closed-ended tasks, rather than processes which evolve through time, which is understandable as it’s harder to get quantitative results in the latter mode. But I think GPT’s ability to simulate text automata is the source of its most surprising and pivotal implications for paths to superintelligence: for how AI capabilities are likely to unfold and for the design-space we can conceive.
"Do stuff that seems legibly valuable" becomes the main currency, rather than "do stuff that is actually valuable."
In my experience, these are aligned quite often, and a good organization/team/manager's job is keeping them aligned. This involves lots of culture-building, vigilance around goodharting, and recurring check-ins and reevaluations to make sure the layer under you is properly aligned. Some of the most effective things I've noticed are rituals like OKR planning and wide-review councils, and having a performance evaluation culture that tries to uphold objective standards.
And of course, the level after this you describe, where people are basically pretending to work, is something effective organizations try to weed out ruthlessly. I'm sure it exists, but competitive pressures are against it persisting for long.
My direct experience is as an increasingly-senior software engineer at Google. It leads me to be much more optimistic about corporations' abilities to be effective than this article. I truly believe that the things have been set up such that that what gets me and my teammates good ratings or promotions, are actually valuable.
Oh, I didn't realize it was such a huge difference! Almost half of the sequences omitted. Wow. I guess I can write a quick program to diff and thus answer my original question.
Was there any discussion about how and why R:A-Z made the selections it did?
Do you have any recommendations for particularly good sequences or posts that R:A-Z omitted, from the ~430 that I've apparently missed?
This seems like a simulator in the same way the human imagination is a simulator. I could mentally simulate a few chess moves after the ones you prompted. After a while (probably a short while) I'd start losing track of things and start making bad moves. Eventually I'd probably make illegal moves, or maybe just write random move-like character strings if I was given some motivation for doing so and thought I could get away with it.