Posts

OMMC Announces RIP 2024-04-01T23:20:00.433Z
Why Are Bacteria So Simple? 2023-02-06T03:00:31.837Z

Comments

Comment by aysja on The first future and the best future · 2024-04-27T05:44:12.405Z · LW · GW

I don't know what Katja thinks, but for me at least: I think AI might pose much more lock-in than other technologies. I.e., I expect that we'll have much less of a chance (and perhaps much less time) to redirect course, adapt, learn from trial and error, etc. than we typically do with a new technology. Given this, I think going slower and aiming to get it right on the first try is much more important than it normally is.  

Comment by aysja on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-25T23:13:40.461Z · LW · GW

I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.   

Comment by aysja on Daniel Dennett has died (1942-2024) · 2024-04-23T00:40:08.712Z · LW · GW

Dennett meant a lot to me, in part because he’s shaped my thinking so much, and in part because I think we share a kindred spirit—this ardent curiosity about minds and how they might come to exist in a world like ours. I also think he is an unusually skilled thinker and writer in many respects, as well as being an exceptionally delightful human. I miss him. 

In particular, I found his deep and persistent curiosity beautiful and inspiring, especially since it’s aimed at all the (imo) important questions. He has a clarity of thought which manages to be both soft and precise, and a robust ability to detect and avoid bullshit. His book Intuition Pumps and Other Tools for Thinking is full of helpful cognitive strategies, many of which I’ve benefited from, and many of which have parallels in the Sequences. You can just tell that he’s someone in love with minds and the art of thinking, and that he’s actually trying at it.  

But perhaps the thing I find most inspiring about him, the bit which I most want to emulate, is that he doesn’t shy away from the difficult questions—consciousness, intentionality, what are real patterns, how can we tell if a system understanding something, etc—but he does so without any lapse in intellectual rigor. He’s always aiming at operationalization and gears-level understanding, but he’s careful to check for whether mechanistic models in fact correspond to the higher level he’s attempting to address. He doesn’t let things be explained away, but he also doesn’t let things remain mysterious. He’s deeply committed to a materialistic understanding of the world which permits of minds. 

In short, he holds the same mysteries that I do, I think, of how thinking things could come to exist in a world made out of atoms, and he’s committed, as I am, to naturalizing such mysteries in a satisfying way. 

He’s also very clear about the role of philosophy in science: it’s the process of figuring out what the right questions even are, such that one can apply the tools of science to answer them. I think he’s right, both that this is the role of good philosophy and that we’re all pretty confused about what the right questions of mind are. I think he did an excellent job of narrowing the confusion, which is a really fucking cool and admirable thing to spend a life on. But the work isn’t done. In many ways, I view my research as picking up where he left off—the quest for a satisfying account of minds in a materialistic, deterministic world. Now that he’s passed, I realize that I really wanted him to see that. I wanted to show him my work. I feel like part of the way I was connected to the world has been severed, and I am feeling grief about that. 

I’ve learned so much from Dennett. How to think better, how to hold my curiosity better, how to love the mind, and how to wonder productively about it. I feel like the world glows dimmer now than it did before, and I feel that grief—the blinking out of this beautiful light. But it is also a good time to reflect on all that he’s done for the world, and all that he’s done for me. He is really a part of me, and I feel the love and the gratitude for what he’s brought into my life. 

Comment by aysja on Express interest in an "FHI of the West" · 2024-04-20T01:27:18.271Z · LW · GW

Aw man, this is so exciting! There’s something really important to me about rationalist virtues having a home in the world. I’m not sure if what I’m imagining is what you’re proposing, exactly, but I think most anything in this vicinity would feel like a huge world upgrade to me.

Apparently I have a lot of thoughts about this. Here are some of them, not sure how applicable they are to this project in particular. I think you can consider this to be my hopes for what such a thing might be like, which I suspect shares some overlap.


It has felt to me for a few years now like something important is dying. I think it stems from the seeming inevitability of what’s before us—the speed of AI progress, our own death, the death of perhaps everything—that looms, shadow-like. And it’s scary to me, and sad, because “inevitability” is a close cousin of “defeat,” and I fear the two inch closer all the time.   

It’s a fatalism that creeps in slow, but settles thick. And it lurks, I think, in the emotional tenor of doom that resides beneath nominally probabilistic estimates of our survival. Lurks as well, although much more plainly, within AI labs: AGI is coming whether we want it to or not, pausing is impossible, the invisible hand holds the reins, or as Claude recently explained to me, “the cat is already out of the bag.” And I think this is sometimes intentional—we are supposed to think about labs in terms of the overwhelming incentives, more than we are supposed to think about them as composed of agents with real choice, because that dispossesses them of responsibility, and dispossesses us of the ability to change them.

There is a similar kind of fatalism that often attaches to the idea of the efficient marketplace—that what is desired has already been done, that if one sits back and lets the machine unfold it will arrive at all the correct conclusions itself. There is no room, in that story, for genuinely novel ideas or progress, all forward movement is the result of incremental accretions on existing structures. This sentiment looms in academia as well—that there is nothing fundamental or new left to uncover, that all low hanging fruit has been plucked. Academic aims rarely push for all that could be—progress is instead judged relatively, the slow inching away from what already is. 

And I worry this mentality is increasingly entrenching itself within AI safety, too. That we are moving away from the sort of ambitious science that I think we need to achieve the world that glows—the sort that aims at absolute progress—and instead moving closer to an incremental machine. After all, MIRI tried and failed to develop agent foundations so maybe we can say, “case closed?” Maybe “solving alignment” was never the right frame in the first place. Maybe it always was that we needed to do the slow inching away from the known, the work that just so happens not to challenge existing social structures. There seems to me, in other words, to be a consensus closing in: new theoretical insights are unlikely to emerge, let alone to have any real impact on engineering. And unlikelier, still, to happen in time. 

I find all of this fatalism terribly confused. Not only because it has, I think, caused people to increasingly depart from the theoretical work which I believe is necessary to reach the world that glows, but because it robs us of our agency. The closer one inches towards inevitability, the further one inches away from the human spirit having any causal effect in the world. What we believe is irrelevant, what is good and right is irrelevant; the grooves have been worn, the structures erected—all that’s left is for the world to follow course. We cannot simply ask people to do what’s right, because they apparently can’t. We cannot succeed at stopping what is wrong, because the incentives are too strong to be opposed. All we can do, it seems, is to meld with the structure itself, making minor adjustments on the margin.  

And there’s a feeling I get, sometimes, when I look at all of this, as if a tidal wave were about to engulf me. The machine has a life of its own; the world is moved by forces outside of my control. And it scares me, and I feel small. But then I remember that it’s wrong. 


There was a real death, I think, that happened when MIRI leadership gave up on solving alignment, but we haven’t yet held the funeral. I think people carry that—the shadow of the fear, unnamed but tangible: that we might be racing towards our inevitable death, that there might not be much hope, that the grooves have been worn, the structures erected, and all that’s left is to give ourselves away as we watch it all unravel. It’s not a particularly inspiring vision, and in my opinion, not a particularly correct one. The future is built out of our choices; they matter, they are real. Not because it would be nice to believe it, but because it is macroscopically true. If one glances at history, it’s obvious that ideas are powerful, that people are powerful. The incentives do not dictate everything, the status quo is never the status quo for very long. The future is still ours to decide. And it’s our responsibility to do so with integrity. 

I have a sense that this spirit has been slipping, with MIRI leadership largely admitting defeat, with CFAR mostly leaving the scene, with AI labs looming increasingly large within the culture and the discourse. I don’t want it to. I want someone to hold the torch of rationality and all its virtues, to stay anchored on what is true and good amidst a landscape of rapidly changing power dynamics, to fight for what’s right with integrity, to hold a positive vision for humanity. I want a space for deep inquiry and intellectual rigor, for aiming at absolute progress, for trying to solve the god damn problem. I think Lightcone has a good shot at doing a fantastic job of bringing something like this to life, and I’m very exited to see what comes of this!  

Comment by aysja on Express interest in an "FHI of the West" · 2024-04-19T21:52:14.680Z · LW · GW

Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the core unless you deeply care about the core yourself. In any case, I think this skill carries over to object-level research, e.g., he often seems, to me, to ask cutting-to-the core type questions there, too. I also think he's great at argument: legible reasoning, identifying the important cruxes in conversations, etc., all of which makes it easier to tell the bullshit from the not. 

I do not think of Oliver as being afraid to be disagreeable, and ime he gets to the heart of things quite quickly, so much so that I found him quite startling to interact with when we first met. And although I have some disagreements over Oliver's past walled-garden taste, from my perspective it's getting better, and I am increasingly excited about him being at the helm of a project such as this. Not sure what to say about his beacon-ness, but I do think that many people respect Oliver, Lightcone, and rationality culture more generally; I wouldn't be that surprised if there were an initial group of independent researcher types who were down and excited for this project as is. 

Comment by aysja on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-18T05:17:57.197Z · LW · GW

This is very cool! I’m excited to see where it goes :)

A couple questions (mostly me grappling with what the implications of this work might be):

  • Given a dataset of sequences of tokens, how do you find the HMM that could have generated it, and can this be done automatically? Also, is the mapping from dataset to HMM unique?
  • This question is possibly more confused on my end, sorry if so. I’m trying to get at something like “how interpretable will these simplexes be with much larger models?” Like, if I’m imagining that each state is a single token, and the HMM is capable of generating the totality of data the model sees, then I’m imagining something quite unwieldy, i.e., something with about the amount of complexity and interpretability as, e.g., the signaling cascade networks in a cell. Is this imagination wrong? Or is it more like, you start with this unwieldy structure (but which has some nice properties nonetheless), and then from there you try to make the initial structure more parse-able? Maybe a more straightforward way to ask: you say you’re interested in formalizing things like situational awareness with these tools—how might that work?
Comment by aysja on [deleted post] 2024-04-09T09:17:56.495Z

Something feels very off to me about these kinds of speciest arguments. Like the circle of moral concern hasn’t expanded, but imploded, rooting out the very center from which it grew. Yes, there is a sense in which valuing what I value is arbitrary and selfish, but concluding that I should completely forego what I value seems pretty alarming to me, and I would assume, to most other humans who currently exist.

Comment by aysja on Alexander Gietelink Oldenziel's Shortform · 2024-03-29T09:35:07.084Z · LW · GW

I guess I'm not sure what you mean by "most scientific progress," and I'm missing some of the history here, but my sense is that importance-weighted science happens proportionally more outside of academia. E.g., Einstein did his miracle year outside of academia (and later stated that he wouldn't have been able to do it, had he succeeded at getting an academic position), Darwin figured out natural selection, and Carnot figured out the Carnot cycle, all mostly on their own, outside of academia. Those are three major scientists who arguably started entire fields (quantum mechanics, biology, and thermodynamics). I would anti-predict that future scientific progress, of the field-founding sort, comes primarily from people at prestigious universities, since they, imo, typically have some of the most intense gatekeeping dynamics which make it harder to have original thoughts. 

Comment by aysja on Natural Latents: The Concepts · 2024-03-21T18:39:02.371Z · LW · GW

I don’t see how the cluster argument resolves the circularity problem. 

The circularity problem, as I see it, is that your definition of an abstraction shouldn’t be dependent on already having the abstraction. I.e., if the only way to define the abstraction “dog” involves you already knowing the abstraction “dog” well enough to create the set of all dogs, then probably you’re missing some of the explanation for abstraction. But the clusters in thingspace argument also depends on having an abstraction—knowing to look for genomes, or fur, or bark, is dependent on us already understanding what dogs are like. After all, there are nearly infinite “axes” one could look at, but we already know to only consider some of them. In other words, it seems like this has just passed the buck from choice of object to choice of properties, but you’re still making that choice based on the abstraction. 

The fact that choice of axis—from among the axes we already know to be relevant—is stable (i.e., creates the same clusterings) feels like a central and interesting point about abstractions. But it doesn’t seem like it resolves the circularity problem. 

(In retrospect the rest of this comment is thinking-out-loud for myself, mostly :p but you might find it interesting nonetheless). 

I think it’s hard to completely escape this problem—we need to use some of our own concepts when understanding the territory, as we can’t see it directly—but I do think it’s possible to get a bit more objective than this. E.g., I consider thermodynamics/stat mech to be pretty centrally about abstractions, but it does so in a way that feels more “territory first,” if that makes any sense. Like, it doesn’t start with the conclusion. It started with the observation that “heat moves stuff” and “what’s up with that” and then eventually landed with an analysis of entropy involving macrostates. Somehow that progression feels more natural to me than starting with “dogs are things” and working backwards. E.g., I think I’m wanting something more like “if we understand these basic facts about the world, we can talk about dogs” rather than “if we start with dogs, we can talk sensibly about dogs.” 

To be clear, I consider some of your work to be addressing this. E.g., I think the telephone theorem is a pretty important step in this direction. Much of the stuff about redundancy and modularity feels pretty tip-of-the-tongue onto something important, to me. But, at the very least, my goal with understanding abstractions is something like “how do we understand the world such that abstractions are natural kinds”? How do we find the joints such that, conditioning on those, there isn’t much room to vary? What are those joints like? The reason I like the telephone theorem is that it gives me one such handle: all else equal, information will dissipate quickly—anytime you see information persisting, it’s evidence of abstraction. 

My own sense is that answering this question will have a lot more to do with how useful abstractions are, rather than how predictive/descriptive they are, which are related questions, but not quite the same. E.g., with the gears example you use to illustrate redundancy, I think the fact that we can predict almost everything about the gear from understanding a part of it is the same reason why the gear is useful. You don’t have to manipulate every atom in the gear to get it to move, you only have to press down on one of the… spokes(?), and the entire thing will turn. These are related properties. But they are not the same. E.g., you can think about the word “stop” as an abstraction in the sense that many sound waves map to the same “concept,” but that’s not very related to why the sound wave is so useful. It’s useful because it fits into the structure of the world: other minds will do things in response to it.
 
I want better ways to talk about how agents get work out of their environments by leveraging abstractions. I think this is the reason we ultimately care about them ourselves; and why AI will too. I also think it’s a big part of how we should be defining them—that the natural joint is less “what are the aggregate statistics of this set” but more “what does having this information allow us to do”? 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-18T23:06:45.566Z · LW · GW

I think it’s pretty unlikely that Anthropic’s murky strategy is good. 

In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of a threat capabilities pose—the more this strategy starts seeming pretty worrying to me.  

It’s worrying because I don’t imagine Anthropic gets that many “shots” at playing safety cards, so to speak. Like, implementing RSPs and trying to influence norms is one thing, but what about if they notice something actually-maybe-dangerous-but-they’re-not-sure as they’re building? Now they’re in this position where if they want to be really careful (e.g., taking costly actions like: stop indefinitely until they’re beyond reasonable doubt that it’s safe) they’re most likely kind of screwing their investors, and should probably expect to get less funding in the future. And the more likely it is, from their perspective, that the behavior in question does end up being a false alarm, the more pressure there is to not do due diligence. 

But the problem is that the more ambiguous the situation is—the less we understand about these systems—the less sure we can be about whether any given behavior is or isn’t an indication of something pretty dangerous. And the current situation seems pretty ambiguous to me. I don’t think anyone knows, for instance, whether Claude 3 seeming to notice it’s being tested is something to worry about or not. Probably it isn’t. But really, how do we know? We’re going off of mostly behavioral cues and making educated guesses about what the behavior implies. But that really isn’t very reassuring when we’re building something much smarter than us, with potentially catastrophic consequences. As it stands, I don’t believe we can even assign numbers to things in a very meaningful sense, let alone assign confidence above a remotely acceptable threshold, i.e., some 9’s of assurance that what we’re about to embark on won’t kill everyone.    

The combination of how much uncertainty there is in evaluating these systems, and how much pressure there is for Anthropic to keep scaling seems very worrying to me. Like, if there’s a very obvious sign that a system is dangerous, then I believe Anthropic might be in a good position to pause and “sound the alarm.” But if things remain kind of ambiguous due to our lack of understanding, as they seem to me now, then I’m way less optimistic that the outcome of any maybe-dangerous-but-we’re-not-sure behavior is that Anthropic meaningfully and safely addresses it. In other words, I think that given our current state of understanding, the murky strategy favors “build AGI” more than it does “build AGI safely” and that scares me. 

I also think the prior should be quite strong, here, that the obvious incentives will have the obvious effects. Like, creating AGI is desirable (so long as it doesn’t kill everyone and so on). Not only on the “loads of money” axis, but also along other axes monkeys care about: prestige, status, influence, power, etc. Yes, practically no one wants to die, and I don’t doubt that many people at Anthropic genuinely care and are worried about this. But, also, it really seems like you should a priori expect that with stakes this high, cognition will get distorted around whether or not to pursue the stakes. Maybe all Anthropic staff are perfectly equipped to be epistemically sane in such an environment, but I don’t think that one should on priors expect it. People get convinced of all kinds of things when they have a lot to gain, or a lot to lose. 

Anyway, it seems likely to me that we will continue to live in the world where we don’t understand these systems well enough to be confident in our evaluations of them, and I assign pretty significant probability to the worlds where capabilities far outstrip our alignment techniques, so I am currently not thrilled that Anthropic exists. I expect that their murky strategy is net bad for humanity, given how the landscape currently looks. 

Maybe you really do need to iterate on frontier AI to do meaningful safety work.

This seems like an open question that, to my mind, Anthropic has not fully explored. One way that I sometimes think about this is to ask: if Anthropic were the only leading AI lab, with no possibility of anyone catching up any time soon, should they still be scaling as fast as they are? My guess is no. Like, of course the safety benefit to scaling is not zero. But it’s a question of whether the benefits outweigh the costs. Given how little we understand these systems, I’d be surprised if we were anywhere near to hitting diminishing safety returns—as in, I don’t think the safety benefits of scaling vastly outstrip the benefit we might expect out of empirical work on current systems. And I think the potential cost of scaling as recklessly we currently are is extinction. I don’t doubt that at some point scaling will be necessary and important for safety; I do doubt that the time for that is now. 

Maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not.

It really feels like if you create an organization which, with some unsettlingly large probability, might directly lead to the extinction of humanity, then you’re doing something wrong. Especially so, if the people that you’re making the decisions for (i.e., everyone), would be—if they fully understood the risks involved—on reflection unhappy about it. Like, I’m pretty sure that the sentence from Anthropic’s pitch deck “these models could begin to automate large portions of the economy” is already enough for many people to be pretty upset. But if they learned that Anthropic also assigned ~33% to a “pessimistic world” which includes the possibility “extinction” then I expect most people would rightly be pretty furious. I think making decisions for people in a way that they would predictably be upset about is unethical, and it doesn’t make it okay just because other people would do it anyway.  

In any case, I think that Anthropic’s existence has hastened race dynamics, and I think that makes our chances of survival lower. That seems pretty in line with what to expect from this kind of strategy (i.e., that it cashes out to scaling coming before safety where it’s non-obvious what to do), and I think it makes sense to expect things of this type going forward (e.g., I am personally pretty skeptical that Anthropic is going to ~meaningfully pause development unless it’s glaringly obvious that they should do so, at which point I think we’re clearly in a pretty bad situation). And although OpenAI was never claiming as much of a safety vibe as Anthropic currently is, I still think the track record of ambiguous strategies which play to both sides does not inspire that much confidence about Anthropic’s trajectory. 

Does Dario-and-other-leadership have good models of x-risk?

I am worried about this. My read on the situation is that Dario is basically expecting something more like a tool than an agent. Broadly, I get this sense because when I model Anthropic as operating under the assumption that risks mostly stem from misuse, their actions make a lot more sense to me. But also things like this quote seem consistent with that: “I suspect that it may roughly work to think of the model as if it's trained in the normal way, just getting to above human level, it may be a reasonable assumption… that the internal structure of the model is not intentionally optimizing against us.” (Dario on the Dwarkesh podcast). If true, this makes me worried about the choices that Dario is going to make, when, again, it’s not clear how to interpret the behavior of these systems. In particular, it makes me worried he’s going to err on the side of “this is probably fine,” since tools seem, all else equal, less dangerous than agents. Dario isn’t the only person Anthropic’s decisions depend on, still, I think his beliefs have a large bearing on what Anthropic does. 

But, the way I wish the conversation was playing out was less like "did Anthropic say a particular misleading thing?"

I think it’s pretty important to call attention to misleading things. Both because there is some chance that public focus on inconsistencies might cause them to change their behavior, and because pointing out specific problems in public arenas often causes evidence to come forward in one common space, and then everyone can gain a clearer understanding of what’s going on. 

Comment by aysja on On Claude 3.0 · 2024-03-07T21:59:38.382Z · LW · GW

Things like their RSP rely on being upheld in spirit, not only in letter.

This is something I’m worried about. I think that Anthropic’s current RSP is vague and/or undefined on many crucial points. For instance, I feel pretty nervous about Anthropic’s proposed response to an evaluation threshold triggering. One of the first steps is that they will “conduct a thorough analysis to determine whether the evaluation was overly conservative,” without describing what this “thorough analysis” is, nor who is doing it. 

In other words, they will undertake some currently undefined process involving undefined people to decide whether it was a false alarm. Given how much is riding on this decision—like, you know, all of the potential profit they’d be losing if they concluded that the model was in fact dangerous—it seems pretty important to be clear about how these things will be resolved. 

Instituting a policy like this is only helpful insomuch as it meaningfully constrains the company’s behavior. But when the response to evaluations are this loosely and vaguely defined, it’s hard for me to trust that the RSP cashes out to more than a vague hope that Anthropic will be careful. It would be nice to feel like the Long Term Benefit Trust provided some kind of assurance against this. But even this seems difficult to trust when they’ve added “failsafe provisions” that allow a “sufficiently large” supermajority of stockholders to make changes to the Trust’s powers (without the Trustees consent), and without saying what counts as “sufficiently large.”  

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-07T00:20:29.551Z · LW · GW

It seems plausible that this scenario could happen, i.e., that Anthropic and OpenAI end up in a stable two-player oligopoly. But I would still be pretty surprised if Anthropic's pitch to investors, when asking for billions of dollars in funding, is that they pre-commit to never release a substantially better product than their main competitor. 

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-07T00:05:45.089Z · LW · GW

I agree that this is a plausible read of their pitch to investors, but I do think it’s a bit of a stretch to consider it the most likely explanation. It’s hard for me to believe that Anthropic would receive billions of dollars in funding if they're explicitly telling investors that they’re committing to only release equivalent or inferior products relative to their main competitor.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T21:21:07.742Z · LW · GW

I assign >50% that Anthropic will at some point pause development for at least six months as a result of safety evaluations. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T21:07:41.957Z · LW · GW

I believed, prior to the Claude 3 release, that Anthropic had committed to not meaningfully push the frontier. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T21:07:30.383Z · LW · GW

I believed, prior to the Claude 3 release, that Anthropic had implied they were not going to meaningfully push the frontier.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:54:38.652Z · LW · GW

I currently believe that Anthropic is planning to meaningfully push the frontier. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:54:16.039Z · LW · GW

I currently believe that Anthropic previously committed to not meaningfully push the frontier.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:53:12.681Z · LW · GW

I assign >10% that Anthropic will at some point completely halt development of AI, and attempt to persuade other organizations to as well (i.e., “sound the alarm.”)

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:52:34.353Z · LW · GW

I assign >10% that Anthropic will at some point pause development for at least a year as a result of safety evaluations.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:52:17.430Z · LW · GW

I assign >10% that Anthropic will at some point pause development for at least six months as a result of safety evaluations. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:52:01.193Z · LW · GW

I assign >10% that Anthropic will at some point pause development as a result of safety evaluations. 

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-06T01:40:34.169Z · LW · GW

I interpreted you, in your previous comment, as claiming that Anthropic’s RSP is explicit in its compatibility with meaningfully pushing the frontier. Dustin is under the impression that Anthropic verbally committed otherwise. Whether or not Claude 3 pushed the frontier seems somewhat orthogonal to this question—did Anthropic commit and/or heavily imply that they weren’t going to push the frontier, and if so, does the RSP quietly contradict that commitment? My current read is that the answer to both questions is yes. If this is the case, I think that Anthropic has been pretty misleading about a crucial part of their safety plan, and this seems quite bad to me.

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-06T01:38:05.272Z · LW · GW

If one of the effects of instituting a responsible scaling policy was that Anthropic moved from the stance of not meaningfully pushing the frontier to “it’s okay to push the frontier so long as we deem it safe,” this seems like a pretty important shift that was not well communicated. I, for one, did not interpret Anthropic’s RSP as a statement that they were now okay with advancing state of the art, nor did many others; I think that’s because the RSP did not make it clear that they were updating this position. Like, with hindsight I can see how the language in the RSP is consistent with pushing the frontier. But I think the language is also consistent with not pushing it. E.g., when I was operating under the assumption that Anthropic had committed to this, I interpreted the RSP as saying “we’re aiming to scale responsibly to the extent that we scale at all, which will remain at or behind the frontier.”

Attempting to be forthright about this would, imo, look like a clear explanation of Anthropic’s previous stance relative to the new one they were adopting, and their reasons for doing so. To the extent that they didn’t feel the need to do this, I worry that it’s because their previous stance was more of a vibe, and therefore non-binding. But if they were using that vibe to gain resources (funding, talent, etc.), then this seems quite bad to me. It shouldn’t both be the case that they benefit from ambiguity but then aren’t held to any of the consequences of breaking it. And indeed, this makes me pretty wary of other trust/deferral based support that people currently give to Anthropic. I think that if the RSP in fact indicates a departure from their previous stance of not meaningfully pushing the frontier, then this is a negative update about Anthropic holding to the spirit of their commitments. 

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T00:39:26.277Z · LW · GW

Several people have pointed out that this post seems to take a different stance on race dynamics than was expressed previously.

I think it clearly does. From my perspective, Anthropic's post is misleading either way—either Claude 3 doesn’t outperform its peers, in which case claiming otherwise is misleading, or they are in fact pushing the frontier, in which case they’ve misled people by suggesting that they would not do this. 

Also, “We do not believe that model intelligence is anywhere near its limits, and we plan to release frequent updates to the Claude 3 model family over the next few months” doesn’t inspire much confidence that they’re not trying to surpass other models in the near future. 

In any case, I don’t see much reason to think that Anthropic is not aiming to push the frontier. For one, to the best of my knowledge they’ve never even publicly stated they wouldn’t; to the extent that people believe it anyway, it is, as best I can tell, mostly just through word of mouth and some vague statements from Dario. Second, it’s hard for me to imagine that they’re pitching investors on a plan that explicitly aims to make an inferior product relative to their competitors. Indeed, their leaked pitch deck suggests otherwise: “We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles.” I think the most straightforward interpretation of this sentence is that Anthropic is racing to build AGI.

And if they are indeed pushing the frontier, this seems like a negative update about them holding to other commitments about safety. Because while it’s true that Anthropic never, to the best of my knowledge, explicitly stated that they wouldn’t do so, they nevertheless appeared to me to strongly imply it. E.g., in his podcast with Dwarkesh, Dario says: 

I think we've been relatively responsible in the sense that we didn't cause the big acceleration that happened late last year and at the beginning of this year. We weren't the ones who did that. And honestly, if you look at the reaction of Google, that might be ten times more important than anything else. And then once it had happened, once the ecosystem had changed, then we did a lot of things to stay on the frontier. 

And Dario on an FLI podcast

I think we shouldn't be racing ahead or trying to build models that are way bigger than other orgs are building them. And we shouldn't, I think, be trying to ramp up excitement or hype about giant models or the latest advances. But we should build the things that we need to do the safety work and we should try to do the safety work as well as we can on top of models that are reasonably close to state of the art. 

None of this is Dario saying that Anthropic won’t try to push the frontier, but it certainly heavily suggests that they are aiming to remain at least slightly behind it. And indeed, my impression is that many people expected this from Anthropic, including people who work there, which seems like evidence that this was the implied message. 

If Anthropic is in fact attempting to push the frontier, then I think this is pretty bad. They shouldn't be this vague and misleading about something this important, especially in a way that caused many people to socially support them (and perhaps make decisions to work there). I perhaps cynically think this vagueness was intentional—it seems implausible to me that Anthropic did not know that people believed this yet they never tried to correct it, which I would guess benefited them: safety-conscious engineers are more likely to work somewhere that they believe isn’t racing to build AGI. Hopefully I’m wrong about at least some of this. 

In any case, whether or not Claude 3 already surpasses the frontier, soon will, or doesn’t, I request that Anthropic explicitly clarify whether their intention is to push the frontier.

Comment by aysja on Why does generalization work? · 2024-02-21T00:26:24.179Z · LW · GW

I think dust theory is wrong in the most permissive sense: there are physical constraints on what computations (and abstractions) can be like. The most obvious one is "things that are not in each others lightcone can't interact" and interaction is necessary for computation (setting aside acausal trades and stuff, which I think are still causal in a relevant sense, but don't want to get into rn). But there are also things like: information degrades over distance (roughly the number of interactions, i.e., telephone theorem) and so you'd expect "large" computations to take a certain shape, i.e., to have a structure which supports this long-range communication such as e.g., wires. 

More than that, though, I think if you disrespect the natural ordering of the environment you end up paying thermodynamic costs. Like, if you take the spectrum of visible light, ordered from ~400 to 800 nm and you just randomly pick wavelengths and assign them to colors arbitrarily (e.g., "red" is wavelengths 505, 780, 402, etc.), then you have to pay more cost to encode the color. Because, imo, the whole point of abstractions is that they're strategically imprecise. I don't have to model the exact wavelengths of the color red, it's whatever is in the range ~600-800, and I can rely on averages to encode that well enough. But if red is wavelengths 505, 780, 402, etc., now averages won't help, and I need to make more precise measurements. Precision is costly: it uses more bits, and bits have physical cost (e.g., Landauer's limit). 

I guess you could argue that someone else might go and see the light spectrum differently, i.e., what looks like wavelengths 505 vs 780 to us looks like wavelengths 505 vs 506 to them? But without a particular reason to think so it seems like a general purpose counterargument to me. You could always say that someone would see it differently—but why would they?     

Comment by aysja on Dreams of AI alignment: The danger of suggestive names · 2024-02-16T09:02:22.414Z · LW · GW

I agree with you that people get sloppy with these terms, and this seems bad. But there’s something important to me about holding space for uncertainty, too. I think that we understand practically every term on this list exceedingly poorly. Yes, we can point to things in the world, and sometimes even the mechanisms underlying them, but we don’t know what we mean in any satisfyingly general way. E.g. “agency” does not seem well described to me as “trained by reinforcement learning.” I don’t really know what it is well described by, and that's the point. Pretending otherwise only precludes us from trying to describe it better.

I think there's a lot of room for improvement in how we understand minds, i.e., I expect science is possible here. So I feel wary of mental moves such as these, e.g., replacing “optimal” with “set of sequential actions which have subjectively maximal expected utility relative to [entity X]'s imputed beliefs,” as if that settled the matter. Because I think it gives a sense that we know what we’re talking about when I don’t think we do. Is a utility function the right way to model an agent? Can we reliably impute beliefs? How do we know we’re doing that right, or that when we say ‘belief’ it maps to something that is in fact like a belief? What is a belief? Why actions instead of world states? And so on.

It seems good to aim for precision and gears-level understanding wherever possible. But I don’t want this to convince us that we aren’t confused. Yes, we could replace the “tool versus agent” debate with things like “was it trained via RL or not,” or what have you, but it wouldn’t be very satisfying because ultimately that isn’t the thing we’re trying to point at. We don’t have good definitions of mind-type things yet, and I don’t want us to forget that. 

Comment by aysja on Critiques of the AI control agenda · 2024-02-15T23:41:27.589Z · LW · GW

At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don't understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where "sure" is more like "99.9%" rather than "70%." In general, the eval situation seems crazy offense advantaged to me: if there's just one thing that we haven't looked for that the AI can sneak by, we've lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check "is it there—yes or no"? And if we don't know how to do that, then I don't understand how we can feel confident in control strategies, either. 

Comment by aysja on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-12T23:28:54.915Z · LW · GW

we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.

I do actually think this is basically true. It seems to me that when people encounter that maps are not the territory—see that macrostates are relative to our perceptual machinery or what have you—they sometimes assume that this means the territory is arbitrarily permissive of abstractions. But that seems wrong to me: the territory constrains what sorts of things maps are like. The idea of natural abstractions, imo, is to point a bit better at what this “territory constrains the map” thing is. 

Like sure, you could make up some abstraction, some summary statistic like “the center point of America” which is just the point at which half of the population is on one side and half on the other (thanks to Dennett for this example). But that would be horrible, because it’s obviously not very joint-carvey. Where “joint carvy-ness” will end up being, I suspect, related to “gears that move the world,” i.e., the bits of the territory that can do surprisingly much, have surprisingly much reach, etc. (similar to the conserved information sense that John talks about). And I think that’s a territory property that minds pick up on, exploit, etc. That the directionality is shaped more like “territory to map,” rather than “map to territory.” 

Another way to say it is that if you sampled from the space of all minds (whatever that space, um, is), anything trying to model the world would very likely end up at the concept “pressure.” (Although I don’t love this definition because I think it ends up placing too much emphasis on maps, when really I think pressure is more like a territory object, much more so than, e.g., the center point of America is). 

There again I think the correct answer is the intentional stance: an agent is whatever is useful for me to model as intention-driven. 

I think the intentional stance is not the right answer here, and we should be happy it’s not because it’s approximately the worst sort of knowledge possible. Not just behaviorist (i.e., not gears-level), but also subjective (relative to a map), and arbitrary (relative to my map). In any case, Dennett’s original intention with it was not to be the be-all end-all definition of agency. He was just trying to figure out where the “fact of the matter” resided. His conclusion: the predictive strategy. Not the agent itself, nor the map, but in this interaction between the two. 

But Dennett, like me, finds this unsatisfying. The real juice is in the question of why the intentional stance works so well. And the answer to that is, I think, almost entirely a territory question. What is it about the territory, such that this predictive strategy works so well? After all, if one analyzes the world through the logic of the intentional stance, then everything is defined relative to a predictive strategy: oranges, chairs, oceans, planets. And certainly, we have maps. But it seems to me that the way science has proceeded in the past is to treat such objects as “out there” in a fundamental way, and that this has fared pretty well so far. I don’t see much reason to abandon it when it comes to agents. 

I think a science of agency, to the extent it inherits the intentional stance, should focus not on defining agents this way, but on asking why it works at all. 

Comment by aysja on So8res's Shortform · 2024-02-10T22:50:00.864Z · LW · GW

It seems like this is only directionally better if it’s true, and this is still an open question for me. Like, I buy that some of the commitments around securing weights are true, and that seems good. I’m way less sure that companies will in fact pause development pending their assessment of evaluations. And to the extent that they are not, in a meaningful sense, planning to pause, this seems quite bad. It seems potentially worse, to me, to have a structure legitimizing this decision and making it seem more responsible than it is, rather than just openly doing a reckless thing. Not only because it seems dishonest, but also because unambiguous behavior is easier for people to point at and hence to understand, to stop, etc.

I don’t want to stomp on hope, but I’d also like labs not to stomp out existence. AI companies are risking people’s lives without their consent—much more so than is remotely acceptable, with their estimated risk of extinction/catastrophe sometimes as high as 33%—this seems unacceptable to me. They should absolutely be getting pushback if their commitments are not up to par. Doing relatively better is not what matters.

Comment by aysja on Leading The Parade · 2024-02-02T21:39:22.152Z · LW · GW

I do think that counterfactual impact is an important thing to track, although two people discovering something at the same time doesn’t seem like especially strong evidence that they were just "leading the parade." It matters how large the set is. I.e., I doubt there were more than ~5 people around Newton’s time who could have come up with calculus. Creating things is just really hard, and I think often a pretty conjunctive set of factors needs to come together to make it happen (some of those are dispositional (ambition, intelligence, etc.), others are more like “was the groundwater there,” and others are like “did they even notice there was something worth doing here in the first place” etc).

Another way to say it is that there’s a reason only two people discovered calculus at the same time, and not tens, or hundreds. Why just two? A similar thing happened with Darwin, where Wallace came up with natural selection around the same time (they actually initially published it together). But having read a bunch about Darwin and that time period I feel fairly confident that they were the only two people “on the scent,” so to speak. Malthus influenced them both, as did living in England when the industrial revolution really took off (capitalism has a “survival of the fittest” vibe), so there was some groundwater there. But it was only these two who took that groundwater and did something powerful with it, and I don’t think there were that many other people around who could have. (One small piece of evidence that that effect: Origin of Species was published a year and a half after their initial publication, and no one else published anything on natural selection within that timespan, even after the initial idea was out there).

Also, I mostly agree about Shannon being more independent, although I do think that Turing was “on the scent” of information theory as well. E.g., from The Information: “Turing cared about the data that changed the probability: a probability factor, something like the weight of the evidence. He invented a unit he named a ‘ban.’ He found it convenient to use a logarithmic scale, so that bans would be added rather than multiplied. With a base of ten, a ban was the weight of evidence needed to make a fact ten times as likely.” This seems, to me, to veer pretty close to information theory and I think this is fairly common: a few people “on the scent,” i.e., noticing that there’s something interesting to discover somewhere, having the right questions in the first place, etc.—but only one or two who actually put in the right kind of effort to complete the idea.

There’s also something important to me about the opposite problem, which is how to assign blame when “someone else would have done it anyway.” E.g., as far as I can tell, much of Anthropic’s reasoning for why they’re not directly responsible for AI risk is because scaling is inevitable, i.e., that other labs would do it anyway. I don’t agree with them on the object-level claim (i.e., it seems possible to cause regulation to institute a pause), but even if I did, I still want to assign them blame for in fact being the ones taking the risky actions. This feels more true for me the fewer actors there are, i.e., at the point when there are only three big labs I think each of them is significantly contributing to risk, whereas if there were hundreds of leading labs I’d be less upset by any individual one. But there’s still a part of me that feels deontological about it, too—a sense that you’re just really not supposed to take actions that risky, no matter how inculpable you are counterfactually speaking.

Likewise, I have similar feelings about scientific discoveries. The people who did them are in fact the ones who did the work, and that matters to me. It matters more the smaller the set of possible people is, of course, but there’s some level upon which I want to be like “look they did an awesome thing here; it in fact wasn’t other people, and I want to assign them credit for that.” It’s related to a sense I have that doing great work is just really hard and that people perpetually underestimate this difficulty. For instance, people sometimes write off any good Musk has done (e.g., the good for climate change by creating Tesla, etc.) by saying “someone else would have made Tesla anyway” and I have to wonder, “really?” I certainly don’t look at the world and expect to see Teslas popping up everywhere. Likewise, I don’t look at the world and expect to see tons of leading AI labs, nor do I expect to see hundreds of people pushing the envelope on understanding what minds are. Few people try to do great things, and I think the set of people who might have done any particular great thing is often quite small.

Comment by aysja on The case for ensuring that powerful AIs are controlled · 2024-01-29T18:32:34.193Z · LW · GW

The incremental approach bakes in a few assumptions, namely that there likely won't be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don't know that this will hold, and that there's reason to suspect it won't. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.

I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want, here, than “thousands of people incrementally contributing.” When fields are thoroughly confused, thousands of people contributing mostly just amounts to more confusion, especially when everyone is working within a broken paradigm. And often in this situation—when new fields are needed—substantial scientific progress happens because of one or a few individuals. I don’t expect a single person to solve the entire problem by themselves, of course, but I do think it’s plausible that one or a few people will develop a robust underlying theory that marks substantial progress towards “solving alignment.”

Comment by aysja on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-26T18:46:29.430Z · LW · GW

Thanks for writing this up! I've been considering writing something in response to AI is easy to control for a while now, in particular arguing against their claim that "If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior." I think Section 4 does a good job of explaining why this probably isn't true, with the basic problem being that the space of behaviors consistent with the training data is larger than the space of behaviors you might "desire." 

Like, sure, if you have a mapping from synapses to desired behavior, okay—but the key word there is "desired" and at that point you're basically just describing having solved mechanistic interpretability. In the absence of knowing exactly how synapses/weights/etc map onto the desired behavior, you have to rely on the behavior in the training set to convey the right information. But a) it's hard to know that the desired behavior is "in" the training set in a very robust way and b) even if it were you might still run into problems like deception, not generalizing to out of distribution data, etc. Anyway, thanks for doing such a thorough write-up of it :) 

Comment by aysja on "Does your paradigm beget new, good, paradigms?" · 2024-01-25T19:32:37.310Z · LW · GW

I think the guiding principle behind whether or not scientific work is good should probably look something more like “is this getting me closer to understanding what’s happening” where “understanding” is something like “my measurements track the thing in a one to one lock-step with reality because I know the right typings and I’ve isolated the underlying causes well enough.”

AI control doesn’t seem like it’s making progress on that goal, which is certainly not to say it’s not important—it seems good to me to be putting some attention on locally useful things. Whereas the natural abstractions agenda does feel like progress on that front.

As an aside: I dislike basically all words about scientific progress at this point. I don’t feel like they’re precise enough and it seems easy to get satiated on them and lose track of what’s actually important which is, imo, absolute progress on the problem of understanding what the fuck is going on with minds. Calling this sort of work “science” risks lumping it in with every activity that happens in e.g., academia, and that isn’t right. Calling it “pre-paradigmatic” risks people writing it off as “Okay so people just sit around being confused for years? How could that be good?”

I wish we had better ways of talking about it. I think that more precisely articulating what our goals are with agent foundations/paradigmaticity/etc could be very helpful, not only for people pursuing it, but for others to even have a sense of what it might mean for field founding science to help in solving alignment. As it is, it seems to often get rounded off to “armchair philosophy” or “just being sort of perpetually confused” which seems bad.

Comment by aysja on My research agenda in agent foundations · 2024-01-25T00:47:41.466Z · LW · GW

I worry that overemphasizing fast feedback loops ends up making projects more myopic than is good for novel research. Like, unfortunately for everyone, a bunch of what makes good research is good research taste, and no one really understands what that is or how to get it and so it's tricky to design feedback loops to make it better. Like, I think the sense of "there's something interesting here that no one else is seeing" is sort of inherently hard to get feedback on, at least from other people. Because if you could, or if it were easy to explain from the get-go, then probably other people would have already thought it. You can maybe get feedback from the world, but it often takes awhile to turn that sense into something concrete. E.g., Einstein was already onto the idea that there was something strange about light and relativity from when he was 16, but he didn't have a shippable idea until about ten years later.  

I don't think it always takes ten years, but deconfusion work is just... weird, e.g., Einstein was prone to bouts of "psychic tension," and would spend weeks in a "state of confusion." He wasn't part of academia, shared his thoughts with few people, and did very little experimentation. Which isn't to say that there was literally no feedback involved, there was, but it's a complicated story which I think unfortunately involves a bunch of feedback directly from his research taste (i.e., his taste was putting constraints like "unity," "logical," "physical meaning" on the hypothesis space). And this certainly isn't to say "let's not try to make it go faster," obviously, if one could make research faster that seems great. But I think it's a careful balancing act, and I worry that putting too much pressure on speed and legibility is going to end up causing people to do science under the streetlight. I really do not want this to happen. Field founding science is a bunch weirder than normal science, and I want to take care in giving research taste enough space to find its feet. 

Comment by aysja on My research agenda in agent foundations · 2024-01-25T00:26:46.695Z · LW · GW

My impression is that Alex is trying to figure out what things like "optimization" are actually like, and that this analysis will apply to a wider variety of systems than just ML. Which makes sense to me—imo, anchoring too much on current systems seems unlikely to produce general, robust solutions to alignment. 

Comment by aysja on A model of research skill · 2024-01-25T00:03:44.195Z · LW · GW

Seconded! I love Holden's posts on wicked problems, I revisit them like once a week or whenever I'm feeling down about my work :p

I've also found it incredibly useful to read historical accounts of great scientists. There's just all kinds of great thinking tips scattered among biographies, many of which I've encountered on LessWrong before, but somehow seeing them in the context of one particular intellectual journey is very helpful. 

Reading Einstein's biography (by Walter Isaacson) was by far my favorite. I felt like I got a really good handle for his style of thinking (e.g., how obsessed he was with unity—like how he felt it “unbearable” that there should be an essential difference between a magnet moving through a conducting coil and a coil moving around a magnet, although the theories at the time posited such a difference; his insistence on figuring out the physical meaning of things—with special relativity, this was the operationalization of "time," with quanta this was giving meaning to an otherwise mathematical curiosity that Planck had discovered; his specific style of thought experiments; and just a sense of how wonderful and visceral his curiosity about the world was, like how as a very young child his father brought him a compass, and as he watched the needle align due to some apparently hidden force field he trembled and grew cold at the prospect of non-mechanical causes). He's so cool! 

Comment by aysja on The Plan - 2023 Version · 2024-01-24T23:45:43.468Z · LW · GW

Yeah, I think I misspoke a bit. I do think that controllability is related to understanding, but also I’m trying to gesture at something more like “controllability implies simpleness.”

Where, I think what I’m tracking with “controllability implies simpleness” is that the ease with which we can control things is a function of how many causal factors there are in creating it, i.e., “conjunctive things are less likely” in some Occam’s Razor sense, but also conjunctive things “cost more.” At the very least, they cost more from the agents point of view. Like, if I have to control every step in some horror graph in order to get the outcome I want, that’s a lot of stuff that has to go perfectly, and that’s hard. If “I,” on the other hand, as a virus, or cancer, or my brain, or whatever, only have to trigger a particular action to set off a cascade which reliably results in what I want, this is easier.

There is a sense, of course, in which the spaghetti code thing is still happening underneath it all, but I think it matters what level of abstraction you’re taking with respect to the system. Like, with enough zoom, even very simple activities start looking complex. If you looked at a ball following a parabolic arc at the particle level, it’d be pretty wild. And yet, one might reasonably assume that the regularity with which the ball follows the arc is meaningful. I am suspicious that much of the “it’s all ad-hoc” intuitions about biology/neuroscience/etc are making a zoom error (also typing errors, but that’s another problem). Even simple things can look complicated if you don’t look at them right, and I suspect that the ease with which we can operate in our world should be some evidence that this simpleness “exists,” much like the movement of the ball is “simple” even though a whole world of particles underlies its behavior.

To your point, I do think that the ability to control something is related to understanding it, but not exactly the same. Like, there indeed might be things in the environment that we don’t have much control over, even if we can clearly see what their effects are. Although, I’d be a little surprised if there were true for, e.g., obesity. Like, it seems likely that hormones (semaglutide) can control weight loss, which makes sense to me. A bunch of global stuff in bodies is regulated by hormones, in a similar way, I think, to how viruses “hook into” the cells machinery, i.e., it reliably triggers the right cascades. And controlling something doesn’t necessarily mean that it's possible to understand it completely. But I do suspect that the ability to control implies the existence of some level of analysis upon which the activity is pretty simple.

Comment by aysja on Recursive Middle Manager Hell · 2024-01-20T20:35:10.522Z · LW · GW

Are those things that good? I don't feel like I notice a huge quality of life difference from the pens I used ten years ago versus the pens I use now. Same with laptops and smartphones (although I care unusually little about that kind of thing so maybe I'm just not tracking it). Medicines have definitely improved although it seems worth noting that practically everyone I know has some terrible health problem they can't fix and we all still die. 

I feel like pushing the envelope on feature improvements is way easier than pushing the envelope on fundamental progress, and progress on the former seems compatible, to me, with pretty broken institutions. In some respects, small feature improvements is what you'd expect from middle manager hell, kind of like the lowest common denominator of a legible signal that you're doing something. It's true that these companies probably wouldn't exist if they were all around terrible. But imo it's more that they become pretty myopic and lame relative to what they could be. I think academia has related problems, too. 

Comment by aysja on Recursive Middle Manager Hell · 2024-01-20T20:23:02.195Z · LW · GW

Google Brain was developed as part of X (Google's "moonshot factory"), which is their way of trying to create startups/startup culture within a large corporation. So was Waymo. 

Comment by aysja on Against most, but not all, AI risk analogies · 2024-01-14T06:42:51.870Z · LW · GW

They establish almost nothing of importance about the behavior and workings of real AIs, but nonetheless give the impression of a model for how we should think about AIs. 

How do you know that they establish nothing of importance? 

Many proponents of AI risk seem happy to critique analogies when they don't support the desired conclusion, such as the anthropomorphic analogy. 

At the very least, this seems to go both ways. Like, afaict, one of Quintin and Nora’s main points in “AI is Easy to Control” is that aligning AI is pretty much just like aligning humans, with the exception that we (i.e., backpropagation) have full access to the weights which makes aligning AI easier. But is aligning a human pretty much like aligning an AI? Can we count on the AI to internalize our concepts in the same way? Do humans come with different priors that make them much easier to “align”? Is the dissimilarity “AI might be vastly more intelligent and powerful than us” not relevant at all, on this question? Etc. But I don’t see them putting much rigor into that analogy—it’s just something that they assume and then move on. 

My point is that we should stop relying on analogies in the first place. Use detailed object-level arguments instead!


It seems reasonable, to me, to request more rigor when using analogies. It seems pretty wild to request that we stop relying on them altogether, almost as if you were asking us to stop thinking. Analogies seem so core to me when developing thought in novel domains, that it’s hard to imagine life without them. Yes, there are many ways AI might be. That doesn’t mean that our present world has nothing to say about it. E.g., I agree that evolution differs from ML in some meaningful ways. But it also seems like a mistake to completely throw out a major source of evidence we have about how intelligence was produced. Of course there will be differences. But no similarities? And do those similarities tell us nothing about the intelligences we might create? That seems like an exceedingly strong claim. 

Comment by aysja on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-14T04:45:35.371Z · LW · GW

I think lots of folks (but not all) would be up in arms, claiming "but modern results won't generalize to future systems!" And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it's socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I'm being too cynical, but that's my reaction.

Fwiw, my reaction to something like “we can finetune the AI to be nice in a stable way” is more like—but is it actually “nice”? I.e., I don’t feel like we’re at all clear on what “niceness” is, and behavioral proxies to that effect feel like some, but only pretty weak evidence about it.

This is my basic concern with evaluations, too. At the moment they just don’t seem at all robust enough for me to feel confident about what the results mean. But I see the sleeper agents work as progress towards the goal of “getting better at thinking about deception” and I feel pretty excited about that.

I think it’s reasonable to question whether these systems are in fact deceptive (just as I would with “niceness”). But when I evaluate results it’s not like “is this an optimistic or pessimistic update” and more like “does it seem like we understand more about X than we did before?” I think we understand more about deception because of this work, and I think that’s cool and important.

Comment by aysja on Gentleness and the artificial Other · 2024-01-13T03:52:16.923Z · LW · GW

This post is so wonderful, thank you for writing it. I’ve gone back to re-read many paragraphs over and over.

A few musings of my own:

“It’s just” … something. Oh? So eager, the urge to deflate. And so eager, too, the assumption that our concepts carve, and encompass, and withstand scrutiny. It’s simple, you see. Some things, like humans, are “sentient.” But Bing Sydney is “just” … you know. Actually, I don’t. What were you going to say? A machine? Software? A simulator? “Statistics?”

This has long driven me crazy. And I think you’re right about the source of the eagerness, although I suspect that mundanity is playing a role here, too. I suspect, in other words, that people often mistake the familiar for the understood—that no matter how strange some piece of reality is, if it happens frequently enough people come to find it normal; and hence, on some basic level, explained.

Like you, I have felt mesmerized by ctenophores at the Monterey Aquarium. I remember sitting there for an hour, staring at these curious creatures, watching their bioluminescent LED strips flicker as they gently floated in the blackness. It was so surreal. And every few minutes, this psychedelic experience would be interrupted by screaming children. Most of them would run up to the exhibit for a second, point, and then run on as their parents snapped a few pictures. Some would say “Mom, I’m bored, can we look at the otters?” And occasionally a couple would murmur to each other “That’s so weird.” But most people seemed unfazed.  

I’ve been unfazed at times, too. And when I am, it’s usually because I’m rounding off my experience to known concepts. “Oh, a fish-type thing? I know what that’s like, moving on.” As if “fish-type thing” could encompass the piece of reality behind the glass. Whereas when I have these ethereal moments of wonder—this feeling of brushing up against something that’s too huge to hold—I am dropping all of that. And it floods in, the insanity of it all—that “I” am a thing, watching this strange, flickering creature in front of me, made out of similar substances and yet so wildly different. So gossamer, the jellies are—and containing, presumably, experience. What could that be like? 

“Justs” are all too often a tribute to mundanity—the sentiment that the things around us are normal and hence, explained? And it’s so easy for things to seem normal when your experience of the world is smooth. I almost always feel myself mundane, for instance. Like a natural kind. I go to the store, call my friends, make dinner. All of it is so seamless—so regular, so simple—that it’s hard to believe any strangeness could be lurking beneath. But then, sometimes, the wonder catches me, and I remember how glaringly obvious it is that minds are the most fascinating phenomenon in the universe. I remember how insane it is—that some lumps of matter are capable of experience, of thought, of desire, of making reality bend to those desires. Are they? What does that mean? How could I be anything at all?

Minds are so weird. Not weird in the “things don’t add up to normality” way—they do. Just that, being a lump of matter like this is a deeply strange endeavor. And I fear that our familiarity with our selves blinds us to this fact. Just as it blinds us to how strange these new minds—this artificial Other, might be. And how tempting, it is, to take the thing that is too huge to hold and to paper over it with a “just” so that we may feel lighter. To mistake our blindness for understanding. How tragic a thing, to forego the wonder. 

Comment by aysja on MIRI 2024 Mission and Strategy Update · 2024-01-08T04:01:47.383Z · LW · GW

We think it’s very unlikely that the AI alignment field will be able to make progress quickly enough to prevent human extinction.

From my point of view it seems possible that we could solve technical alignment, or otherwise become substantially deconfused, within 10 years—perhaps much sooner. I don’t think we’ve ruled out that foundational scientific progress is capable of solving the problem, nor that cognitively unenhanced humans might be able to succeed at such an activity. Like, as far as I can tell, very few people have tried working on the problem directly, in the sense of forming original lines of attack (maybe between 10 to 20?), many of whom share similar ontological backgrounds. This doesn’t seem like overwhelming evidence, to me, that the situation is doomed. 

For instance, in the late 1800’s physicists were so ontologically committed to the idea of absolute rest that they spent many decades searching for ether, instead of discovering special relativity. Even Lorentz and Poincaré, who both had many of the key ideas for special relativity, never made the final leap—even after Einstein’s publication—because they were so committed to their traditional notions of space and time. If everyone within a field is operating under the same incorrect ontological assumptions it can seem like progress is impossible, when in fact progress is just hard when you have the wrong concepts. 

Also, conceptual progress can happen quickly. I don’t think it necessarily looks (to most of the outside world) like people are making steady progress towards deconfusion. I think it often looks more like “that person is doing some weird thing over there” until they present an inferential-distance-crossing work and it clicks for the rest of the world. At least, this is roughly what happened for Einstein (with his “miracle year”—introducing special relativity, light quanta, etc.), Newton (with his “year of wonders”—theory of gravity, calculus, and many insights into optics), and Darwin (with Origin of Species).

In other words, I don’t think modeling the current landscape as dire is that much evidence that it will remain so for years to come. Things look confusing and hard until they’re not; and historically speaking, great scientists have sometimes been able to make great conceptual progress on seemingly difficult problems—often suddenly, and unexpectedly.

I consider the Sequences to be one of the greatest philosophical texts of the century to date, but while it would be hard to explain in a few sentences, I also think that it got some key ontological commitments wrong. In any case, I worry that MIRI is over-anchoring on their ontology being the correct one, and then concluding that further efforts are doomed. Whereas I strongly suspect there’s room for philosophical work to bear unexpected (and fast) alignment progress. Especially so, given the amount of ontological correlated-ness among the few people who have tried to figure out alignment so far.

I think one way to cause conceptual progress to happen faster is just to have more people working on the problem in more ontologically decorrelated ways. Because of that, I personally feel worried about what seems to me like an increasing push towards policy work or towards already developed agendas. Not that working on either of those is necessarily bad—many such efforts strike me as important bets to make, and I’m deeply grateful that people are pursuing them. Just that, on the margin, I think we ought to be allotting more of our portfolio to people that are developing their own angles on the problem. 

I really want our culture to support minds who take on the strange, difficult, and vulnerable task of trying to make scientific progress at the frontier of human knowledge. And I don’t want to lose sight of that, or for the miasma of generalized hopelessness to make people less likely to try it.  

Comment by aysja on The Plan - 2023 Version · 2024-01-02T11:01:17.517Z · LW · GW

Goodheart is about generalization, not approximation.

This seems true, and I agree that there’s a philosophical hangup along the lines of “but everything above the lowest level is fuzzy and subjective, so all of our concepts are inevitably approximate.” I think there are a few things going on with this, but my guess is that part of what this is tracking—at least in the case of biology and AI—is something like “tons of tiny causal factors.” 

Like, I think one of the reasons that optimizing hard for good grades as a proxy for doing well at a job fails is because “getting good grades” has many possible causes. One way is by being intelligent; others include cheating, memorizing a bunch of stuff, etc. I.e., there are many pathways to the target, and if you optimize hard for the target, then you get people putting more effort into the pathways they can meaningfully influence. 

Approximate measures are bad, I think, when the measure is not in one-to-one lock step with the “real thing.” This happens with grades, but it isn’t true, for instance, in science when we understand the causal underpinnings of phenomena. I know exactly how to increase pressure in a box (increase the number of molecules, make it hotter, etc.), and there is no other secret causal factor which might achieve the same end. 

The question, to me, is whether or not these sorts of clean causal structures (as we have with, e.g. “pressure”) exist in biological systems or AI. I suspect they do, and that we just have to find the right way of understanding them. But I think part of the reason people expect these domains to be hopelessly messy is because their behavior seems to be mostly determined by tons of tiny casual factors (chemical signals, sequence of specific neurons firing, etc.)... Like, I read this biology paper once where the authors hilariously lamented that their chart of signaling cascades was a “horror graph” where “everything does everything to everything.” 

Indeed, the underlying structure “everything does everything to everything” does seem less conducive to being precisely conceptualized. Especially when (imo) ideal scientific understanding consists of finding the few, isolated causal factors which produce an effect (as in the case of "pressure"). But if everything we care about—abstractions, agency, goals, etc.—are coarse-grainings over billions of semi-independent causal factors, produced in path-dependent ways, then the idea of finding concepts as precise as “pressure” starts looking kind of hopeless, I think.  

Like you, I suspect that this is mostly an ontology issue (i.e., that neurons/etc are not necessarily the natural units), and I do think that precise concepts of agency/abstractions/etc both exist and are findable. Partially this is an empirical claim—as you note, we do actually find surprisingly clean structures in biological systems (such as modularity). But also, my proposed angle against “fractal complexity” is something like: I expect life to be understandable because it needs to control itself. I’m not going to give the full argument here, but to speak in my own mentalese for a paragraph: 

When you get large, directed systems—(e.g., we are composed of 40 trillion cells, each containing tens of millions of proteins)—I think you basically need some level of modularity if there’s any hope of steering the whole thing. E.g., I think one of the reasons we see modular structure across all levels in biology is because modular structures are easier to control (the input/output structure is much cleaner than if it were some messy clump where “everything does everything to everything”). If my brain had to figure out how to navigate some hyper complex horror graph with a gazillion tiny causal factors every time I wanted to move my hand, I wouldn’t get very far. There’s more argument here to totally spell this out—like why that can’t all be pre-computed, etc.—but I think the fact that biological organisms can reliably coordinate their activity in flexible/general ways is evidence of there being clean structure. 

Comment by aysja on The Plan - 2023 Version · 2024-01-02T10:47:15.239Z · LW · GW

Someone picks a questionable ontology for modeling biological organisms/neural nets - for concreteness, let’s say they try to represent some system as a decision tree.

Lo and behold, this poor choice of ontology doesn’t work very well; the modeler requires a huge amount of complexity to decently represent the real-world system in their poorly-chosen ontology. For instance, maybe they need a ridiculously large decision tree or random forest to represent a neural net to decent precision.

This drove me crazy in cognitive science. There was a huge wave of Bayesian models of cognition in the late 2000’s/2010’s, which was partially motivated by the simplicity and generality of it (you can formulate any learning task this way). But then they swept all of the complexity into the priors and likelihood! “Look at our simple model of word learning” (except that we secretly produced the prior by our complicated and handcrafted tree structure that applies specifically to this problem). This got a bit better over time, but there was still a significant amount of complex, hardcoded structure behind the scenes that was never really justified. 

Comment by aysja on Ronny and Nate discuss what sorts of minds humanity is likely to find by Machine Learning · 2023-12-23T00:24:08.233Z · LW · GW

Fwiw, I generally find Quintin’s writing unclear and difficult to read (I bounce a lot) and Nora’s clear and easy, even though I agree with Quintin slightly more (although I disagree with both of them substantially).

I do think there is something to “views that are very different from ones own” being difficult to understand, sometimes, although I think this can be for a number of reasons. Like, for me at least, understanding someone with very different beliefs can be both time intensive and cognitively demanding—I usually have to sit down and iterate on “make up a hypothesis of what I think they’re saying, then go back and check if that’s right, update hypothesis, etc.” This process can take hours or days, as the cruxes tend to be deep and not immediately obvious.

Usually before I’ve spent significant time on understanding writing in this way, e.g. during the first few reads, I feel like I’m bouncing, or otherwise find myself wanting to leave. But I think the bouncing feeling is (in part) tracking that the disagreement is really pervasive and that I’m going to have to put in a bunch of effort if I actually want to understand it, rather than that I just don't like that they disagree with me.

Because of this, I personally get a lot of value out of interacting with friends who have done the “translating it closer to my ontology” step—it reduces the understanding cost a lot for me, which tends to be higher the further from my worldview the writing is.

Comment by aysja on Quick takes on "AI is easy to control" · 2023-12-21T09:24:02.341Z · LW · GW
Comment by aysja on Why Agent Foundations? An Overly Abstract Explanation · 2023-12-21T05:14:07.234Z · LW · GW

It's pretty hard to argue for things that don't exist yet, but here's a few intuitions for why I think agency related concepts will beget True Names:

  • I think basically every concept looks messy before it’s not, and I think this is particularly true at the very beginning of fields. Like, a bunch of Newton’s early journaling on optics is mostly listing a bunch of facts that he’s noticed—it’s clear with hindsight which ones are relevant and which aren’t, but before we have a theory it can just look like a messy pile of observations from the outside. Perusing old science, from before theory or paradigm, gives me a similar impression for e.g., “heat,” “motion,” and “speciation.” 
  • Many scientists have, throughout history, proclaimed that science is done. Bacon was already pretty riled up about this in the late 1500s when he complained that for thousands of years everyone had been content enough with Aristotle that they hadn’t produced practically any new knowledge. Then he attempted to start the foundation of science, with the goal of finding True Names, which was pretty successful. In the 1800s people were saying it again—that all that was left was to calculate out what we already knew from physics—but of course then Einstein came and changed how we fundamentally conceived of it.
  • The Standard Model of physics also describes motion and speciation, but that doesn’t mean that the way to understand motion and speciation, nor their True Names, lies in clarifying how they relate to the Standard Model. 
  • I think that most of the work in figuring out True Names is in first identifying them, which is the part of the work that looks more like philosophy. E.g., I expect the True Name of a “crab” is ill-posed, but that something like “how crab-like-things use abstractions to achieve their goals” is a more likely candidate for a True Name(s).
  • The inside view reason I expect True Names for agency is hard to articulate, as part of it is a somewhat illegible sense that there are deep principles to the world, and agents make up part of our world. But I also think that, historically, when people have stared long enough at pieces of the world that permit of regularities, they are usually eventually successful (and that’s true even if the particular system itself isn’t regular, such as “chaos” and “randomness,” because you can still find them at a meta level, as with ideas like k-complexity). I don’t see that much reason to suspect that something different is happening with agents. I think it’s a harder problem than other scientific problems have been, but I still expect that it’s solvable. 
Comment by aysja on Anthropic's Core Views on AI Safety · 2023-12-17T00:34:44.723Z · LW · GW

The weird thing about a portfolio approach is that the things it makes sense to work on in “optimistic scenarios” often trade off against those you’d want to work on in more “pessimistic scenarios,” and I don't feel like this is really addressed.

Like, if we’re living in an optimistic world where it’s pretty chill to scale up quickly, and things like deception are either pretty obvious or not all that consequential, and alignment is close to default, then sure, pushing frontier models is fine. But if we’re in a world where the problem is nearly impossible, alignment is nowhere close to default, and/or things like deception happen in an abrupt way, then the actions Anthropic is taking (e.g., rapidly scaling models) are really risky.

This is part of what seems weird to me about Anthropic’s safety plan. It seems like the major bet the company is making is that getting empirical feedback from frontier systems is going to help solve alignment. Much of that justification (afaict from the Core Views post) is because Anthropic expects to be surprised by what emerges in larger models. For instance, as this Anthropic paper mentions: models can’t do 3 digit addition basically at all (close to 0% test accuracy) until all of the sudden, as you scale the model slightly, they can (0% to 80% accuracy abruptly). I presume the safety model here is something like: if you can’t make much progress on problems without empirical feedback, and if you can’t get the empirical feedback unless the capability is present to work with, and if capabilities (or their precursors) only emerge at certain scales, then scaling is a bottleneck to alignment progress.

I’m not convinced by those claims, but I think that even if I were, I would have a very different sense of what to do here. Like, it seems to me that our current state of knowledge about how and why specific capabilities emerge (and when they do) is pretty close to “we have no idea.” That means we are pretty close to having no idea about when and how and why dangerous capabilities might emerge, nor whether they’ll do so continuously or abruptly.

My impression is that Dario agrees with this:

Dwarkesh: “So, dumb it down for me, mechanistically—it doesn’t know addition yet, now it knows addition, what happened?”

Dario: “We don’t know the answer.” (later) “Specific abilities are very hard to predict. When does arithmetic come into place? When do models learn to code? Sometimes it’s very abrupt. It’s kind of like you can predict statistical averages of the weather, but the weather on one particular day is very hard to predict.” 

If I put on the “we need empirical feedback from neural nets to make progress on alignment” hat, along with my “prudence” hat, I’m thinking things more like, “okay let’s stop scaling now, and just work really hard on figuring out how exactly capabilities emerged between e.g., GPT-3 and GPT-4. Like, what exactly can we predict about GPT-4 based on GPT-3? Can we break down surprising and abrupt less-scary capabilities into understandable parts, and generalize from that to more-scary capabilities?” Basically, I’m hoping for a bunch more proof of concept that Anthropic is capable of understanding and controlling current systems, before they scale blindly. If they can’t do it now, why should I expect they’ll be able to do it then?

My guess is that a bunch of these concerns are getting swept under the “optimistic scenario” rug, i.e., “sure, maybe we’d do that if we only expected a pessimistic scenario, but we don’t! And in the optimistic scenario, scaling is pretty much fine, and we can grab more probability mass there so we’re choosing to scale and do the safety we can conditioned on that.” I find this dynamic frustrating. The charitable read on having a uniform prior over outcomes is that you’re taking all viewpoints seriously. The uncharitable read is that it gives you enough free parameters and wiggle room to come to the conclusion that “actually scaling is good” no matter what argument someone levies, because you can always make recourse to a different expected world.

Like, even in pessimistic scenarios (where alignment is nearly impossible), Anthropic still concludes they should be scaling in order to “sound the alarm bell,” despite not saying all that much about how that would work, or if it would work, or making any binding commitments, or saying what precautions they’re taking to make sure they would end up in the “sound the alarm bell” world instead of the “now we’re fucked” world, which are pretty close together. Instead they are taking the action “rapidly scaling systems even though we publicly admit to being in a world where it’s unclear how or when or why different capabilities emerge, nor whether they’ll do so abruptly, and we haven’t figured out how to control these systems in the most basic ways.” I don’t understand how Anthropic thinks this is safe.  

The safety model for pushing frontier models as much as Anthropic is doing doesn’t make sense to me. If you’re expecting to be surprised by newer models, that’s bad. We should be aiming to not be surprised, so that we have any hope of managing something that might be much smarter and more powerful than us. The other reasons this blog post lists for working on frontier models seem similarly strange to me, although I’ll leave it here for now. From where I’m at, it doesn’t seem like safety concerns really justify pushing frontier models, and I’d like to hear Anthropic defend this claim more, given that they cite it as one of the main reasons they exist:

“A major reason Anthropic exists as an organization is that we believe it's necessary to do safety research on ‘frontier’ AI systems.” 

(I’d honestly like to be convinced this does make sense, if I’m missing something here).