Non-Disparagement Canaries for OpenAI 2024-05-30T19:20:13.022Z
OMMC Announces RIP 2024-04-01T23:20:00.433Z
Why Are Bacteria So Simple? 2023-02-06T03:00:31.837Z


Comment by aysja on Non-Disparagement Canaries for OpenAI · 2024-06-04T08:43:28.152Z · LW · GW

I have not heard from anyone who wasn’t released, and I think it is reasonably likely I would have heard from them anonymously on Signal. Also, not releasing a bunch of people after saying they would seems like an enormously unpopular, hard to keep secret, and not very advantageous move for OpenAI, which is already taking a lot of flak for this. 

I’m not necessarily imagining that OpenAI failed to release a bunch of people, although that still seems possible to me. I’m more concerned that they haven’t released many key people, and while I agree that you might have received an anonymous Signal message to that effect if it were true, I still feel alarmed that many of these people haven’t publicly stated otherwise.

I also have a model of how people choose whether or not to make public statements where it’s extremely unsurprising most people would not choose to do so.

I do find this surprising. Many people are aware of who former OpenAI employees are, and hence are aware of who was (or is) bound by this agreement. At the very least, if I were in this position, I would want people to know that I was no longer bound. And it does seem strange to me, if the contract has been widely retracted, that so few prominent people have confirmed being released. 

It also seems pretty important to figure out who is under mutual non-disparagement agreements with OpenAI, which would still (imo) pose a problem if it applied to anyone in safety evaluations or policy positions.

Comment by aysja on Non-Disparagement Canaries for OpenAI · 2024-06-03T22:07:04.283Z · LW · GW

I imagine many of the people going into leadership positions were prepared to ignore the contract, or maybe even forgot about the nondisparagement clause

I could imagine it being the case that people are prepared to ignore the contract. But unless they publicly state that, it wouldn’t ameliorate my concerns—otherwise how is anyone supposed to trust they will?

The clause is also open to more avenues of legal attack if it's enforced against someone who takes another position which requires disparagement (e.g. if it's argued to be a restriction on engaging in business). 

That seems plausible, but even if this does increase the likelihood that they’d win a legal battle, legal battles still pose huge risk and cost. This still seems like a meaningful deterrent.

I don't think it's fair to view this as a serious breach of trust on behalf of any individual, without clear evidence that it impacted their decisions or communication. 

But how could we even get this evidence? If they’re bound to the agreement their actions just look like an absence of saying disparaging things about OpenAI, or of otherwise damaging their finances or reputation. And it’s hard to tell, from the outside, whether this is a reflection of an obligation, or of a genuine stance. Positions of public responsibility require public trust, and the public doesn’t have access to the inner workings of these people’s minds. So I think it’s reasonable, upon finding out that someone has a huge and previously-undisclosed conflict of interest, to assume that might be influencing their behavior.

Comment by aysja on MIRI 2024 Communications Strategy · 2024-05-31T20:59:53.573Z · LW · GW

I agree this is usually the case, but I think it’s not always true, and I don’t think it’s necessarily true here. E.g., people as early as Da Vinci guessed that we’d be able to fly long before we had planes (or even any flying apparatus which worked). Because birds can fly, and so we should be able to as well (at least, this was Da Vinci and the Wright brothers' reasoning). That end point was not dependent on details (early flying designs had wings like a bird, a design which we did not keep :p), but was closer to a laws of physics claim (if birds can do it there isn’t anything fundamentally holding us back from doing it either).

Superintelligence holds a similar place in my mind: intelligence is physically possible, because we exhibit it, and it seems quite arbitrary to assume that we’ve maxed it out. But also, intelligence is obviously powerful, and reality is obviously more manipulable than we currently have the means to manipulate it. E.g., we know that we should be capable of developing advanced nanotech, since cells can, and that space travel/terraforming/etc. is possible. 

These two things together—“we can likely create something much smarter than ourselves” and “reality can be radically transformed”—is enough to make me feel nervous. At some point I expect most of the universe to be transformed by agents; whether this is us, or aligned AIs, or misaligned AIs or what, I don’t know. But looking ahead and noticing that I don’t know how to select the “aligned AI” option from the set “things which will likely be able to radically transform matter” seems enough cause, in my mind, for exercising caution. 

Comment by aysja on OpenAI: Fallout · 2024-05-28T17:53:22.236Z · LW · GW

Bloomberg confirms that OpenAI has promised not to cancel vested equity under any circumstances, and to release all employees from one-directional non-disparagement agreements.

They don't actually say "all" and I haven't seen anyone confirm that all employees received this email. It seems possible (and perhaps likely) to me that many high profile safety people did not receive this email, especially since it would presumably be in Sam's interest to do so, and since I haven't seen them claiming otherwise. And we wouldn't know: those who are still under the contract can't say anything. If OpenAI only sent an email to some former employees then they can come away with headlines like "OpenAI releases former staffers from agreement" which is true, without giving away their whole hand. Perhaps I'm being too pessimistic, but I am under the impression that we're dealing with a quite adversarial player, and until I see hard evidence otherwise this is what I'm assuming. 

Comment by aysja on Maybe Anthropic's Long-Term Benefit Trust is powerless · 2024-05-27T21:36:03.996Z · LW · GW

Why do you think this? The power that I'm primarily concerned about is the power to pause, and I'm quite skeptical that companies like Amazon and Google would be willing to invest billions of dollars in a company which may decide to do something that renders their investment worthless. I.e, I think a serious pause, one on the order of months or years, is essentially equivalent to opting out of the race to AGI. On this question, my strong prior is that investors like Google and Amazon have more power than employees or the trust, else they wouldn't invest. 

Comment by aysja on An explanation of evil in an organized world · 2024-05-02T07:22:22.949Z · LW · GW

"So God can’t make the atoms be arranged one way and the humans be arranged another contradictory way."

But couldn't he have made a different sort of thing than humans, which were less prone to evil? Like, it seems to me that he didn't need to make us evolve through the process of natural selection, such that species were always in competition, status was a big deal, fighting over mates commonplace, etc. I do expect that there's quite a bit of convergence in the space of possible minds—even if one is selecting them from the set of "all possible atomic configurations of minds"—but I would still guess that not all of those are as prone to "evil" as us. I.e., if the laws of physics were held constant, I would think you could get less evil things than us out of it, and probably worlds which were overall more favorable to life (fewer natural disasters, etc.). But perhaps this is even more evidence that God only cares about the laws of physics? Since we seem much more like an afterthought than a priority?   

Comment by aysja on The Intentional Stance, LLMs Edition · 2024-05-01T23:27:46.098Z · LW · GW

Secondly, following Dennett, the point of modeling cognitive systems according to the intentional stance is that we evaluate them on a behavioral basis and that is all there is to evaluate.

I am confused on this point. Several people have stated that Dennett believes something like this, e.g., Quintin and Nora argue that Dennett is a goal "reductionist," by which I think they mean something like "goal is the word we use to refer to certain patterns of behavior, but it's not more fundamental than that."

But I don't think Dennett believes this. He's pretty critical of behaviorism, for instance, and his essay Skinner Skinned does a good job, imo, of showing why this orientation is misguided. Dennett believes, I think, that things like "goals," "beliefs," "desires," etc. do exist, just that we haven't found the mechanistic or scientific explanation of them yet. But he doesn't think that explanations of intention will necessarily bottom out in just their outward behavior, he expects such explanations to make reference to internal states as well. Dennett is a materialist, so of course at the end of the day all explanations will be in terms of behavior (inward or outward), on some level, much like any physical explanation is. But that's a pretty different claim from "mental states do not exist." 

I'm also not sure if you're making that claim here or not, but curious if you disagree with the above? 

Comment by aysja on The first future and the best future · 2024-04-27T05:44:12.405Z · LW · GW

I don't know what Katja thinks, but for me at least: I think AI might pose much more lock-in than other technologies. I.e., I expect that we'll have much less of a chance (and perhaps much less time) to redirect course, adapt, learn from trial and error, etc. than we typically do with a new technology. Given this, I think going slower and aiming to get it right on the first try is much more important than it normally is.  

Comment by aysja on Paul Christiano named as US AI Safety Institute Head of AI Safety · 2024-04-25T23:13:40.461Z · LW · GW

I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.   

Comment by aysja on Daniel Dennett has died (1942-2024) · 2024-04-23T00:40:08.712Z · LW · GW

Dennett meant a lot to me, in part because he’s shaped my thinking so much, and in part because I think we share a kindred spirit—this ardent curiosity about minds and how they might come to exist in a world like ours. I also think he is an unusually skilled thinker and writer in many respects, as well as being an exceptionally delightful human. I miss him. 

In particular, I found his deep and persistent curiosity beautiful and inspiring, especially since it’s aimed at all the (imo) important questions. He has a clarity of thought which manages to be both soft and precise, and a robust ability to detect and avoid bullshit. His book Intuition Pumps and Other Tools for Thinking is full of helpful cognitive strategies, many of which I’ve benefited from, and many of which have parallels in the Sequences. You can just tell that he’s someone in love with minds and the art of thinking, and that he’s actually trying at it.  

But perhaps the thing I find most inspiring about him, the bit which I most want to emulate, is that he doesn’t shy away from the difficult questions—consciousness, intentionality, what are real patterns, how can we tell if a system understanding something, etc—but he does so without any lapse in intellectual rigor. He’s always aiming at operationalization and gears-level understanding, but he’s careful to check for whether mechanistic models in fact correspond to the higher level he’s attempting to address. He doesn’t let things be explained away, but he also doesn’t let things remain mysterious. He’s deeply committed to a materialistic understanding of the world which permits of minds. 

In short, he holds the same mysteries that I do, I think, of how thinking things could come to exist in a world made out of atoms, and he’s committed, as I am, to naturalizing such mysteries in a satisfying way. 

He’s also very clear about the role of philosophy in science: it’s the process of figuring out what the right questions even are, such that one can apply the tools of science to answer them. I think he’s right, both that this is the role of good philosophy and that we’re all pretty confused about what the right questions of mind are. I think he did an excellent job of narrowing the confusion, which is a really fucking cool and admirable thing to spend a life on. But the work isn’t done. In many ways, I view my research as picking up where he left off—the quest for a satisfying account of minds in a materialistic, deterministic world. Now that he’s passed, I realize that I really wanted him to see that. I wanted to show him my work. I feel like part of the way I was connected to the world has been severed, and I am feeling grief about that. 

I’ve learned so much from Dennett. How to think better, how to hold my curiosity better, how to love the mind, and how to wonder productively about it. I feel like the world glows dimmer now than it did before, and I feel that grief—the blinking out of this beautiful light. But it is also a good time to reflect on all that he’s done for the world, and all that he’s done for me. He is really a part of me, and I feel the love and the gratitude for what he’s brought into my life. 

Comment by aysja on Express interest in an "FHI of the West" · 2024-04-20T01:27:18.271Z · LW · GW

Aw man, this is so exciting! There’s something really important to me about rationalist virtues having a home in the world. I’m not sure if what I’m imagining is what you’re proposing, exactly, but I think most anything in this vicinity would feel like a huge world upgrade to me.

Apparently I have a lot of thoughts about this. Here are some of them, not sure how applicable they are to this project in particular. I think you can consider this to be my hopes for what such a thing might be like, which I suspect shares some overlap.

It has felt to me for a few years now like something important is dying. I think it stems from the seeming inevitability of what’s before us—the speed of AI progress, our own death, the death of perhaps everything—that looms, shadow-like. And it’s scary to me, and sad, because “inevitability” is a close cousin of “defeat,” and I fear the two inch closer all the time.   

It’s a fatalism that creeps in slow, but settles thick. And it lurks, I think, in the emotional tenor of doom that resides beneath nominally probabilistic estimates of our survival. Lurks as well, although much more plainly, within AI labs: AGI is coming whether we want it to or not, pausing is impossible, the invisible hand holds the reins, or as Claude recently explained to me, “the cat is already out of the bag.” And I think this is sometimes intentional—we are supposed to think about labs in terms of the overwhelming incentives, more than we are supposed to think about them as composed of agents with real choice, because that dispossesses them of responsibility, and dispossesses us of the ability to change them.

There is a similar kind of fatalism that often attaches to the idea of the efficient marketplace—that what is desired has already been done, that if one sits back and lets the machine unfold it will arrive at all the correct conclusions itself. There is no room, in that story, for genuinely novel ideas or progress, all forward movement is the result of incremental accretions on existing structures. This sentiment looms in academia as well—that there is nothing fundamental or new left to uncover, that all low hanging fruit has been plucked. Academic aims rarely push for all that could be—progress is instead judged relatively, the slow inching away from what already is. 

And I worry this mentality is increasingly entrenching itself within AI safety, too. That we are moving away from the sort of ambitious science that I think we need to achieve the world that glows—the sort that aims at absolute progress—and instead moving closer to an incremental machine. After all, MIRI tried and failed to develop agent foundations so maybe we can say, “case closed?” Maybe “solving alignment” was never the right frame in the first place. Maybe it always was that we needed to do the slow inching away from the known, the work that just so happens not to challenge existing social structures. There seems to me, in other words, to be a consensus closing in: new theoretical insights are unlikely to emerge, let alone to have any real impact on engineering. And unlikelier, still, to happen in time. 

I find all of this fatalism terribly confused. Not only because it has, I think, caused people to increasingly depart from the theoretical work which I believe is necessary to reach the world that glows, but because it robs us of our agency. The closer one inches towards inevitability, the further one inches away from the human spirit having any causal effect in the world. What we believe is irrelevant, what is good and right is irrelevant; the grooves have been worn, the structures erected—all that’s left is for the world to follow course. We cannot simply ask people to do what’s right, because they apparently can’t. We cannot succeed at stopping what is wrong, because the incentives are too strong to be opposed. All we can do, it seems, is to meld with the structure itself, making minor adjustments on the margin.  

And there’s a feeling I get, sometimes, when I look at all of this, as if a tidal wave were about to engulf me. The machine has a life of its own; the world is moved by forces outside of my control. And it scares me, and I feel small. But then I remember that it’s wrong. 

There was a real death, I think, that happened when MIRI leadership gave up on solving alignment, but we haven’t yet held the funeral. I think people carry that—the shadow of the fear, unnamed but tangible: that we might be racing towards our inevitable death, that there might not be much hope, that the grooves have been worn, the structures erected, and all that’s left is to give ourselves away as we watch it all unravel. It’s not a particularly inspiring vision, and in my opinion, not a particularly correct one. The future is built out of our choices; they matter, they are real. Not because it would be nice to believe it, but because it is macroscopically true. If one glances at history, it’s obvious that ideas are powerful, that people are powerful. The incentives do not dictate everything, the status quo is never the status quo for very long. The future is still ours to decide. And it’s our responsibility to do so with integrity. 

I have a sense that this spirit has been slipping, with MIRI leadership largely admitting defeat, with CFAR mostly leaving the scene, with AI labs looming increasingly large within the culture and the discourse. I don’t want it to. I want someone to hold the torch of rationality and all its virtues, to stay anchored on what is true and good amidst a landscape of rapidly changing power dynamics, to fight for what’s right with integrity, to hold a positive vision for humanity. I want a space for deep inquiry and intellectual rigor, for aiming at absolute progress, for trying to solve the god damn problem. I think Lightcone has a good shot at doing a fantastic job of bringing something like this to life, and I’m very exited to see what comes of this!  

Comment by aysja on Express interest in an "FHI of the West" · 2024-04-19T21:52:14.680Z · LW · GW

Huh, I feel confused. I suppose we just have different impressions. Like, I would say that Oliver is exceedingly good at cutting through the bullshit. E.g., I consider his reasoning around shutting down the Lightcone offices to be of this type, in that it felt like a very straightforward document of important considerations, some of which I imagine were socially and/or politically costly to make. One way to say that is that I think Oliver is very high integrity, and I think this helps with bullshit detection: it's easier to see how things don't cut to the core unless you deeply care about the core yourself. In any case, I think this skill carries over to object-level research, e.g., he often seems, to me, to ask cutting-to-the core type questions there, too. I also think he's great at argument: legible reasoning, identifying the important cruxes in conversations, etc., all of which makes it easier to tell the bullshit from the not. 

I do not think of Oliver as being afraid to be disagreeable, and ime he gets to the heart of things quite quickly, so much so that I found him quite startling to interact with when we first met. And although I have some disagreements over Oliver's past walled-garden taste, from my perspective it's getting better, and I am increasingly excited about him being at the helm of a project such as this. Not sure what to say about his beacon-ness, but I do think that many people respect Oliver, Lightcone, and rationality culture more generally; I wouldn't be that surprised if there were an initial group of independent researcher types who were down and excited for this project as is. 

Comment by aysja on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-18T05:17:57.197Z · LW · GW

This is very cool! I’m excited to see where it goes :)

A couple questions (mostly me grappling with what the implications of this work might be):

  • Given a dataset of sequences of tokens, how do you find the HMM that could have generated it, and can this be done automatically? Also, is the mapping from dataset to HMM unique?
  • This question is possibly more confused on my end, sorry if so. I’m trying to get at something like “how interpretable will these simplexes be with much larger models?” Like, if I’m imagining that each state is a single token, and the HMM is capable of generating the totality of data the model sees, then I’m imagining something quite unwieldy, i.e., something with about the amount of complexity and interpretability as, e.g., the signaling cascade networks in a cell. Is this imagination wrong? Or is it more like, you start with this unwieldy structure (but which has some nice properties nonetheless), and then from there you try to make the initial structure more parse-able? Maybe a more straightforward way to ask: you say you’re interested in formalizing things like situational awareness with these tools—how might that work?
Comment by aysja on [deleted post] 2024-04-09T09:17:56.495Z

Something feels very off to me about these kinds of speciest arguments. Like the circle of moral concern hasn’t expanded, but imploded, rooting out the very center from which it grew. Yes, there is a sense in which valuing what I value is arbitrary and selfish, but concluding that I should completely forego what I value seems pretty alarming to me, and I would assume, to most other humans who currently exist.

Comment by aysja on Alexander Gietelink Oldenziel's Shortform · 2024-03-29T09:35:07.084Z · LW · GW

I guess I'm not sure what you mean by "most scientific progress," and I'm missing some of the history here, but my sense is that importance-weighted science happens proportionally more outside of academia. E.g., Einstein did his miracle year outside of academia (and later stated that he wouldn't have been able to do it, had he succeeded at getting an academic position), Darwin figured out natural selection, and Carnot figured out the Carnot cycle, all mostly on their own, outside of academia. Those are three major scientists who arguably started entire fields (quantum mechanics, biology, and thermodynamics). I would anti-predict that future scientific progress, of the field-founding sort, comes primarily from people at prestigious universities, since they, imo, typically have some of the most intense gatekeeping dynamics which make it harder to have original thoughts. 

Comment by aysja on Natural Latents: The Concepts · 2024-03-21T18:39:02.371Z · LW · GW

I don’t see how the cluster argument resolves the circularity problem. 

The circularity problem, as I see it, is that your definition of an abstraction shouldn’t be dependent on already having the abstraction. I.e., if the only way to define the abstraction “dog” involves you already knowing the abstraction “dog” well enough to create the set of all dogs, then probably you’re missing some of the explanation for abstraction. But the clusters in thingspace argument also depends on having an abstraction—knowing to look for genomes, or fur, or bark, is dependent on us already understanding what dogs are like. After all, there are nearly infinite “axes” one could look at, but we already know to only consider some of them. In other words, it seems like this has just passed the buck from choice of object to choice of properties, but you’re still making that choice based on the abstraction. 

The fact that choice of axis—from among the axes we already know to be relevant—is stable (i.e., creates the same clusterings) feels like a central and interesting point about abstractions. But it doesn’t seem like it resolves the circularity problem. 

(In retrospect the rest of this comment is thinking-out-loud for myself, mostly :p but you might find it interesting nonetheless). 

I think it’s hard to completely escape this problem—we need to use some of our own concepts when understanding the territory, as we can’t see it directly—but I do think it’s possible to get a bit more objective than this. E.g., I consider thermodynamics/stat mech to be pretty centrally about abstractions, but it does so in a way that feels more “territory first,” if that makes any sense. Like, it doesn’t start with the conclusion. It started with the observation that “heat moves stuff” and “what’s up with that” and then eventually landed with an analysis of entropy involving macrostates. Somehow that progression feels more natural to me than starting with “dogs are things” and working backwards. E.g., I think I’m wanting something more like “if we understand these basic facts about the world, we can talk about dogs” rather than “if we start with dogs, we can talk sensibly about dogs.” 

To be clear, I consider some of your work to be addressing this. E.g., I think the telephone theorem is a pretty important step in this direction. Much of the stuff about redundancy and modularity feels pretty tip-of-the-tongue onto something important, to me. But, at the very least, my goal with understanding abstractions is something like “how do we understand the world such that abstractions are natural kinds”? How do we find the joints such that, conditioning on those, there isn’t much room to vary? What are those joints like? The reason I like the telephone theorem is that it gives me one such handle: all else equal, information will dissipate quickly—anytime you see information persisting, it’s evidence of abstraction. 

My own sense is that answering this question will have a lot more to do with how useful abstractions are, rather than how predictive/descriptive they are, which are related questions, but not quite the same. E.g., with the gears example you use to illustrate redundancy, I think the fact that we can predict almost everything about the gear from understanding a part of it is the same reason why the gear is useful. You don’t have to manipulate every atom in the gear to get it to move, you only have to press down on one of the… spokes(?), and the entire thing will turn. These are related properties. But they are not the same. E.g., you can think about the word “stop” as an abstraction in the sense that many sound waves map to the same “concept,” but that’s not very related to why the sound wave is so useful. It’s useful because it fits into the structure of the world: other minds will do things in response to it.
I want better ways to talk about how agents get work out of their environments by leveraging abstractions. I think this is the reason we ultimately care about them ourselves; and why AI will too. I also think it’s a big part of how we should be defining them—that the natural joint is less “what are the aggregate statistics of this set” but more “what does having this information allow us to do”? 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-18T23:06:45.566Z · LW · GW

I think it’s pretty unlikely that Anthropic’s murky strategy is good. 

In particular, I think that balancing building AGI with building AGI safely only goes well for humanity in a pretty narrow range of worlds. Like, if safety is relatively easy and can roughly keep pace with capabilities, then I think this sort of thing might make sense. But the more the expected world departs from this—the more that you might expect safety to be way behind capabilities, and the more you might expect that it’s hard to notice just how big that gap is and/or how much of a threat capabilities pose—the more this strategy starts seeming pretty worrying to me.  

It’s worrying because I don’t imagine Anthropic gets that many “shots” at playing safety cards, so to speak. Like, implementing RSPs and trying to influence norms is one thing, but what about if they notice something actually-maybe-dangerous-but-they’re-not-sure as they’re building? Now they’re in this position where if they want to be really careful (e.g., taking costly actions like: stop indefinitely until they’re beyond reasonable doubt that it’s safe) they’re most likely kind of screwing their investors, and should probably expect to get less funding in the future. And the more likely it is, from their perspective, that the behavior in question does end up being a false alarm, the more pressure there is to not do due diligence. 

But the problem is that the more ambiguous the situation is—the less we understand about these systems—the less sure we can be about whether any given behavior is or isn’t an indication of something pretty dangerous. And the current situation seems pretty ambiguous to me. I don’t think anyone knows, for instance, whether Claude 3 seeming to notice it’s being tested is something to worry about or not. Probably it isn’t. But really, how do we know? We’re going off of mostly behavioral cues and making educated guesses about what the behavior implies. But that really isn’t very reassuring when we’re building something much smarter than us, with potentially catastrophic consequences. As it stands, I don’t believe we can even assign numbers to things in a very meaningful sense, let alone assign confidence above a remotely acceptable threshold, i.e., some 9’s of assurance that what we’re about to embark on won’t kill everyone.    

The combination of how much uncertainty there is in evaluating these systems, and how much pressure there is for Anthropic to keep scaling seems very worrying to me. Like, if there’s a very obvious sign that a system is dangerous, then I believe Anthropic might be in a good position to pause and “sound the alarm.” But if things remain kind of ambiguous due to our lack of understanding, as they seem to me now, then I’m way less optimistic that the outcome of any maybe-dangerous-but-we’re-not-sure behavior is that Anthropic meaningfully and safely addresses it. In other words, I think that given our current state of understanding, the murky strategy favors “build AGI” more than it does “build AGI safely” and that scares me. 

I also think the prior should be quite strong, here, that the obvious incentives will have the obvious effects. Like, creating AGI is desirable (so long as it doesn’t kill everyone and so on). Not only on the “loads of money” axis, but also along other axes monkeys care about: prestige, status, influence, power, etc. Yes, practically no one wants to die, and I don’t doubt that many people at Anthropic genuinely care and are worried about this. But, also, it really seems like you should a priori expect that with stakes this high, cognition will get distorted around whether or not to pursue the stakes. Maybe all Anthropic staff are perfectly equipped to be epistemically sane in such an environment, but I don’t think that one should on priors expect it. People get convinced of all kinds of things when they have a lot to gain, or a lot to lose. 

Anyway, it seems likely to me that we will continue to live in the world where we don’t understand these systems well enough to be confident in our evaluations of them, and I assign pretty significant probability to the worlds where capabilities far outstrip our alignment techniques, so I am currently not thrilled that Anthropic exists. I expect that their murky strategy is net bad for humanity, given how the landscape currently looks. 

Maybe you really do need to iterate on frontier AI to do meaningful safety work.

This seems like an open question that, to my mind, Anthropic has not fully explored. One way that I sometimes think about this is to ask: if Anthropic were the only leading AI lab, with no possibility of anyone catching up any time soon, should they still be scaling as fast as they are? My guess is no. Like, of course the safety benefit to scaling is not zero. But it’s a question of whether the benefits outweigh the costs. Given how little we understand these systems, I’d be surprised if we were anywhere near to hitting diminishing safety returns—as in, I don’t think the safety benefits of scaling vastly outstrip the benefit we might expect out of empirical work on current systems. And I think the potential cost of scaling as recklessly we currently are is extinction. I don’t doubt that at some point scaling will be necessary and important for safety; I do doubt that the time for that is now. 

Maybe you do need to stay on the frontier because the world is accelerating whether Anthropic wants it to or not.

It really feels like if you create an organization which, with some unsettlingly large probability, might directly lead to the extinction of humanity, then you’re doing something wrong. Especially so, if the people that you’re making the decisions for (i.e., everyone), would be—if they fully understood the risks involved—on reflection unhappy about it. Like, I’m pretty sure that the sentence from Anthropic’s pitch deck “these models could begin to automate large portions of the economy” is already enough for many people to be pretty upset. But if they learned that Anthropic also assigned ~33% to a “pessimistic world” which includes the possibility “extinction” then I expect most people would rightly be pretty furious. I think making decisions for people in a way that they would predictably be upset about is unethical, and it doesn’t make it okay just because other people would do it anyway.  

In any case, I think that Anthropic’s existence has hastened race dynamics, and I think that makes our chances of survival lower. That seems pretty in line with what to expect from this kind of strategy (i.e., that it cashes out to scaling coming before safety where it’s non-obvious what to do), and I think it makes sense to expect things of this type going forward (e.g., I am personally pretty skeptical that Anthropic is going to ~meaningfully pause development unless it’s glaringly obvious that they should do so, at which point I think we’re clearly in a pretty bad situation). And although OpenAI was never claiming as much of a safety vibe as Anthropic currently is, I still think the track record of ambiguous strategies which play to both sides does not inspire that much confidence about Anthropic’s trajectory. 

Does Dario-and-other-leadership have good models of x-risk?

I am worried about this. My read on the situation is that Dario is basically expecting something more like a tool than an agent. Broadly, I get this sense because when I model Anthropic as operating under the assumption that risks mostly stem from misuse, their actions make a lot more sense to me. But also things like this quote seem consistent with that: “I suspect that it may roughly work to think of the model as if it's trained in the normal way, just getting to above human level, it may be a reasonable assumption… that the internal structure of the model is not intentionally optimizing against us.” (Dario on the Dwarkesh podcast). If true, this makes me worried about the choices that Dario is going to make, when, again, it’s not clear how to interpret the behavior of these systems. In particular, it makes me worried he’s going to err on the side of “this is probably fine,” since tools seem, all else equal, less dangerous than agents. Dario isn’t the only person Anthropic’s decisions depend on, still, I think his beliefs have a large bearing on what Anthropic does. 

But, the way I wish the conversation was playing out was less like "did Anthropic say a particular misleading thing?"

I think it’s pretty important to call attention to misleading things. Both because there is some chance that public focus on inconsistencies might cause them to change their behavior, and because pointing out specific problems in public arenas often causes evidence to come forward in one common space, and then everyone can gain a clearer understanding of what’s going on. 

Comment by aysja on On Claude 3.0 · 2024-03-07T21:59:38.382Z · LW · GW

Things like their RSP rely on being upheld in spirit, not only in letter.

This is something I’m worried about. I think that Anthropic’s current RSP is vague and/or undefined on many crucial points. For instance, I feel pretty nervous about Anthropic’s proposed response to an evaluation threshold triggering. One of the first steps is that they will “conduct a thorough analysis to determine whether the evaluation was overly conservative,” without describing what this “thorough analysis” is, nor who is doing it. 

In other words, they will undertake some currently undefined process involving undefined people to decide whether it was a false alarm. Given how much is riding on this decision—like, you know, all of the potential profit they’d be losing if they concluded that the model was in fact dangerous—it seems pretty important to be clear about how these things will be resolved. 

Instituting a policy like this is only helpful insomuch as it meaningfully constrains the company’s behavior. But when the response to evaluations are this loosely and vaguely defined, it’s hard for me to trust that the RSP cashes out to more than a vague hope that Anthropic will be careful. It would be nice to feel like the Long Term Benefit Trust provided some kind of assurance against this. But even this seems difficult to trust when they’ve added “failsafe provisions” that allow a “sufficiently large” supermajority of stockholders to make changes to the Trust’s powers (without the Trustees consent), and without saying what counts as “sufficiently large.”  

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-07T00:20:29.551Z · LW · GW

It seems plausible that this scenario could happen, i.e., that Anthropic and OpenAI end up in a stable two-player oligopoly. But I would still be pretty surprised if Anthropic's pitch to investors, when asking for billions of dollars in funding, is that they pre-commit to never release a substantially better product than their main competitor. 

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-07T00:05:45.089Z · LW · GW

I agree that this is a plausible read of their pitch to investors, but I do think it’s a bit of a stretch to consider it the most likely explanation. It’s hard for me to believe that Anthropic would receive billions of dollars in funding if they're explicitly telling investors that they’re committing to only release equivalent or inferior products relative to their main competitor.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T21:21:07.742Z · LW · GW

I assign >50% that Anthropic will at some point pause development for at least six months as a result of safety evaluations. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T21:07:41.957Z · LW · GW

I believed, prior to the Claude 3 release, that Anthropic had committed to not meaningfully push the frontier. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T21:07:30.383Z · LW · GW

I believed, prior to the Claude 3 release, that Anthropic had implied they were not going to meaningfully push the frontier.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:54:38.652Z · LW · GW

I currently believe that Anthropic is planning to meaningfully push the frontier. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:54:16.039Z · LW · GW

I currently believe that Anthropic previously committed to not meaningfully push the frontier.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:53:12.681Z · LW · GW

I assign >10% that Anthropic will at some point completely halt development of AI, and attempt to persuade other organizations to as well (i.e., “sound the alarm.”)

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:52:34.353Z · LW · GW

I assign >10% that Anthropic will at some point pause development for at least a year as a result of safety evaluations.

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:52:17.430Z · LW · GW

I assign >10% that Anthropic will at some point pause development for at least six months as a result of safety evaluations. 

Comment by aysja on Vote on Anthropic Topics to Discuss · 2024-03-06T20:52:01.193Z · LW · GW

I assign >10% that Anthropic will at some point pause development as a result of safety evaluations. 

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-06T01:40:34.169Z · LW · GW

I interpreted you, in your previous comment, as claiming that Anthropic’s RSP is explicit in its compatibility with meaningfully pushing the frontier. Dustin is under the impression that Anthropic verbally committed otherwise. Whether or not Claude 3 pushed the frontier seems somewhat orthogonal to this question—did Anthropic commit and/or heavily imply that they weren’t going to push the frontier, and if so, does the RSP quietly contradict that commitment? My current read is that the answer to both questions is yes. If this is the case, I think that Anthropic has been pretty misleading about a crucial part of their safety plan, and this seems quite bad to me.

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-06T01:38:05.272Z · LW · GW

If one of the effects of instituting a responsible scaling policy was that Anthropic moved from the stance of not meaningfully pushing the frontier to “it’s okay to push the frontier so long as we deem it safe,” this seems like a pretty important shift that was not well communicated. I, for one, did not interpret Anthropic’s RSP as a statement that they were now okay with advancing state of the art, nor did many others; I think that’s because the RSP did not make it clear that they were updating this position. Like, with hindsight I can see how the language in the RSP is consistent with pushing the frontier. But I think the language is also consistent with not pushing it. E.g., when I was operating under the assumption that Anthropic had committed to this, I interpreted the RSP as saying “we’re aiming to scale responsibly to the extent that we scale at all, which will remain at or behind the frontier.”

Attempting to be forthright about this would, imo, look like a clear explanation of Anthropic’s previous stance relative to the new one they were adopting, and their reasons for doing so. To the extent that they didn’t feel the need to do this, I worry that it’s because their previous stance was more of a vibe, and therefore non-binding. But if they were using that vibe to gain resources (funding, talent, etc.), then this seems quite bad to me. It shouldn’t both be the case that they benefit from ambiguity but then aren’t held to any of the consequences of breaking it. And indeed, this makes me pretty wary of other trust/deferral based support that people currently give to Anthropic. I think that if the RSP in fact indicates a departure from their previous stance of not meaningfully pushing the frontier, then this is a negative update about Anthropic holding to the spirit of their commitments. 

Comment by aysja on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-05T00:39:26.277Z · LW · GW

Several people have pointed out that this post seems to take a different stance on race dynamics than was expressed previously.

I think it clearly does. From my perspective, Anthropic's post is misleading either way—either Claude 3 doesn’t outperform its peers, in which case claiming otherwise is misleading, or they are in fact pushing the frontier, in which case they’ve misled people by suggesting that they would not do this. 

Also, “We do not believe that model intelligence is anywhere near its limits, and we plan to release frequent updates to the Claude 3 model family over the next few months” doesn’t inspire much confidence that they’re not trying to surpass other models in the near future. 

In any case, I don’t see much reason to think that Anthropic is not aiming to push the frontier. For one, to the best of my knowledge they’ve never even publicly stated they wouldn’t; to the extent that people believe it anyway, it is, as best I can tell, mostly just through word of mouth and some vague statements from Dario. Second, it’s hard for me to imagine that they’re pitching investors on a plan that explicitly aims to make an inferior product relative to their competitors. Indeed, their leaked pitch deck suggests otherwise: “We believe that companies that train the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles.” I think the most straightforward interpretation of this sentence is that Anthropic is racing to build AGI.

And if they are indeed pushing the frontier, this seems like a negative update about them holding to other commitments about safety. Because while it’s true that Anthropic never, to the best of my knowledge, explicitly stated that they wouldn’t do so, they nevertheless appeared to me to strongly imply it. E.g., in his podcast with Dwarkesh, Dario says: 

I think we've been relatively responsible in the sense that we didn't cause the big acceleration that happened late last year and at the beginning of this year. We weren't the ones who did that. And honestly, if you look at the reaction of Google, that might be ten times more important than anything else. And then once it had happened, once the ecosystem had changed, then we did a lot of things to stay on the frontier. 

And Dario on an FLI podcast

I think we shouldn't be racing ahead or trying to build models that are way bigger than other orgs are building them. And we shouldn't, I think, be trying to ramp up excitement or hype about giant models or the latest advances. But we should build the things that we need to do the safety work and we should try to do the safety work as well as we can on top of models that are reasonably close to state of the art. 

None of this is Dario saying that Anthropic won’t try to push the frontier, but it certainly heavily suggests that they are aiming to remain at least slightly behind it. And indeed, my impression is that many people expected this from Anthropic, including people who work there, which seems like evidence that this was the implied message. 

If Anthropic is in fact attempting to push the frontier, then I think this is pretty bad. They shouldn't be this vague and misleading about something this important, especially in a way that caused many people to socially support them (and perhaps make decisions to work there). I perhaps cynically think this vagueness was intentional—it seems implausible to me that Anthropic did not know that people believed this yet they never tried to correct it, which I would guess benefited them: safety-conscious engineers are more likely to work somewhere that they believe isn’t racing to build AGI. Hopefully I’m wrong about at least some of this. 

In any case, whether or not Claude 3 already surpasses the frontier, soon will, or doesn’t, I request that Anthropic explicitly clarify whether their intention is to push the frontier.

Comment by aysja on Why does generalization work? · 2024-02-21T00:26:24.179Z · LW · GW

I think dust theory is wrong in the most permissive sense: there are physical constraints on what computations (and abstractions) can be like. The most obvious one is "things that are not in each others lightcone can't interact" and interaction is necessary for computation (setting aside acausal trades and stuff, which I think are still causal in a relevant sense, but don't want to get into rn). But there are also things like: information degrades over distance (roughly the number of interactions, i.e., telephone theorem) and so you'd expect "large" computations to take a certain shape, i.e., to have a structure which supports this long-range communication such as e.g., wires. 

More than that, though, I think if you disrespect the natural ordering of the environment you end up paying thermodynamic costs. Like, if you take the spectrum of visible light, ordered from ~400 to 800 nm and you just randomly pick wavelengths and assign them to colors arbitrarily (e.g., "red" is wavelengths 505, 780, 402, etc.), then you have to pay more cost to encode the color. Because, imo, the whole point of abstractions is that they're strategically imprecise. I don't have to model the exact wavelengths of the color red, it's whatever is in the range ~600-800, and I can rely on averages to encode that well enough. But if red is wavelengths 505, 780, 402, etc., now averages won't help, and I need to make more precise measurements. Precision is costly: it uses more bits, and bits have physical cost (e.g., Landauer's limit). 

I guess you could argue that someone else might go and see the light spectrum differently, i.e., what looks like wavelengths 505 vs 780 to us looks like wavelengths 505 vs 506 to them? But without a particular reason to think so it seems like a general purpose counterargument to me. You could always say that someone would see it differently—but why would they?     

Comment by aysja on Dreams of AI alignment: The danger of suggestive names · 2024-02-16T09:02:22.414Z · LW · GW

I agree with you that people get sloppy with these terms, and this seems bad. But there’s something important to me about holding space for uncertainty, too. I think that we understand practically every term on this list exceedingly poorly. Yes, we can point to things in the world, and sometimes even the mechanisms underlying them, but we don’t know what we mean in any satisfyingly general way. E.g. “agency” does not seem well described to me as “trained by reinforcement learning.” I don’t really know what it is well described by, and that's the point. Pretending otherwise only precludes us from trying to describe it better.

I think there's a lot of room for improvement in how we understand minds, i.e., I expect science is possible here. So I feel wary of mental moves such as these, e.g., replacing “optimal” with “set of sequential actions which have subjectively maximal expected utility relative to [entity X]'s imputed beliefs,” as if that settled the matter. Because I think it gives a sense that we know what we’re talking about when I don’t think we do. Is a utility function the right way to model an agent? Can we reliably impute beliefs? How do we know we’re doing that right, or that when we say ‘belief’ it maps to something that is in fact like a belief? What is a belief? Why actions instead of world states? And so on.

It seems good to aim for precision and gears-level understanding wherever possible. But I don’t want this to convince us that we aren’t confused. Yes, we could replace the “tool versus agent” debate with things like “was it trained via RL or not,” or what have you, but it wouldn’t be very satisfying because ultimately that isn’t the thing we’re trying to point at. We don’t have good definitions of mind-type things yet, and I don’t want us to forget that. 

Comment by aysja on Critiques of the AI control agenda · 2024-02-15T23:41:27.589Z · LW · GW

At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don't understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where "sure" is more like "99.9%" rather than "70%." In general, the eval situation seems crazy offense advantaged to me: if there's just one thing that we haven't looked for that the AI can sneak by, we've lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check "is it there—yes or no"? And if we don't know how to do that, then I don't understand how we can feel confident in control strategies, either. 

Comment by aysja on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-12T23:28:54.915Z · LW · GW

we have some vague intuition that an abstraction like pressure will always be useful, because of some fundamental statistical property of reality (non-dependent on the macrostates we are trying to track), and that's not quite true.

I do actually think this is basically true. It seems to me that when people encounter that maps are not the territory—see that macrostates are relative to our perceptual machinery or what have you—they sometimes assume that this means the territory is arbitrarily permissive of abstractions. But that seems wrong to me: the territory constrains what sorts of things maps are like. The idea of natural abstractions, imo, is to point a bit better at what this “territory constrains the map” thing is. 

Like sure, you could make up some abstraction, some summary statistic like “the center point of America” which is just the point at which half of the population is on one side and half on the other (thanks to Dennett for this example). But that would be horrible, because it’s obviously not very joint-carvey. Where “joint carvy-ness” will end up being, I suspect, related to “gears that move the world,” i.e., the bits of the territory that can do surprisingly much, have surprisingly much reach, etc. (similar to the conserved information sense that John talks about). And I think that’s a territory property that minds pick up on, exploit, etc. That the directionality is shaped more like “territory to map,” rather than “map to territory.” 

Another way to say it is that if you sampled from the space of all minds (whatever that space, um, is), anything trying to model the world would very likely end up at the concept “pressure.” (Although I don’t love this definition because I think it ends up placing too much emphasis on maps, when really I think pressure is more like a territory object, much more so than, e.g., the center point of America is). 

There again I think the correct answer is the intentional stance: an agent is whatever is useful for me to model as intention-driven. 

I think the intentional stance is not the right answer here, and we should be happy it’s not because it’s approximately the worst sort of knowledge possible. Not just behaviorist (i.e., not gears-level), but also subjective (relative to a map), and arbitrary (relative to my map). In any case, Dennett’s original intention with it was not to be the be-all end-all definition of agency. He was just trying to figure out where the “fact of the matter” resided. His conclusion: the predictive strategy. Not the agent itself, nor the map, but in this interaction between the two. 

But Dennett, like me, finds this unsatisfying. The real juice is in the question of why the intentional stance works so well. And the answer to that is, I think, almost entirely a territory question. What is it about the territory, such that this predictive strategy works so well? After all, if one analyzes the world through the logic of the intentional stance, then everything is defined relative to a predictive strategy: oranges, chairs, oceans, planets. And certainly, we have maps. But it seems to me that the way science has proceeded in the past is to treat such objects as “out there” in a fundamental way, and that this has fared pretty well so far. I don’t see much reason to abandon it when it comes to agents. 

I think a science of agency, to the extent it inherits the intentional stance, should focus not on defining agents this way, but on asking why it works at all. 

Comment by aysja on So8res's Shortform · 2024-02-10T22:50:00.864Z · LW · GW

It seems like this is only directionally better if it’s true, and this is still an open question for me. Like, I buy that some of the commitments around securing weights are true, and that seems good. I’m way less sure that companies will in fact pause development pending their assessment of evaluations. And to the extent that they are not, in a meaningful sense, planning to pause, this seems quite bad. It seems potentially worse, to me, to have a structure legitimizing this decision and making it seem more responsible than it is, rather than just openly doing a reckless thing. Not only because it seems dishonest, but also because unambiguous behavior is easier for people to point at and hence to understand, to stop, etc.

I don’t want to stomp on hope, but I’d also like labs not to stomp out existence. AI companies are risking people’s lives without their consent—much more so than is remotely acceptable, with their estimated risk of extinction/catastrophe sometimes as high as 33%—this seems unacceptable to me. They should absolutely be getting pushback if their commitments are not up to par. Doing relatively better is not what matters.

Comment by aysja on Leading The Parade · 2024-02-02T21:39:22.152Z · LW · GW

I do think that counterfactual impact is an important thing to track, although two people discovering something at the same time doesn’t seem like especially strong evidence that they were just "leading the parade." It matters how large the set is. I.e., I doubt there were more than ~5 people around Newton’s time who could have come up with calculus. Creating things is just really hard, and I think often a pretty conjunctive set of factors needs to come together to make it happen (some of those are dispositional (ambition, intelligence, etc.), others are more like “was the groundwater there,” and others are like “did they even notice there was something worth doing here in the first place” etc).

Another way to say it is that there’s a reason only two people discovered calculus at the same time, and not tens, or hundreds. Why just two? A similar thing happened with Darwin, where Wallace came up with natural selection around the same time (they actually initially published it together). But having read a bunch about Darwin and that time period I feel fairly confident that they were the only two people “on the scent,” so to speak. Malthus influenced them both, as did living in England when the industrial revolution really took off (capitalism has a “survival of the fittest” vibe), so there was some groundwater there. But it was only these two who took that groundwater and did something powerful with it, and I don’t think there were that many other people around who could have. (One small piece of evidence that that effect: Origin of Species was published a year and a half after their initial publication, and no one else published anything on natural selection within that timespan, even after the initial idea was out there).

Also, I mostly agree about Shannon being more independent, although I do think that Turing was “on the scent” of information theory as well. E.g., from The Information: “Turing cared about the data that changed the probability: a probability factor, something like the weight of the evidence. He invented a unit he named a ‘ban.’ He found it convenient to use a logarithmic scale, so that bans would be added rather than multiplied. With a base of ten, a ban was the weight of evidence needed to make a fact ten times as likely.” This seems, to me, to veer pretty close to information theory and I think this is fairly common: a few people “on the scent,” i.e., noticing that there’s something interesting to discover somewhere, having the right questions in the first place, etc.—but only one or two who actually put in the right kind of effort to complete the idea.

There’s also something important to me about the opposite problem, which is how to assign blame when “someone else would have done it anyway.” E.g., as far as I can tell, much of Anthropic’s reasoning for why they’re not directly responsible for AI risk is because scaling is inevitable, i.e., that other labs would do it anyway. I don’t agree with them on the object-level claim (i.e., it seems possible to cause regulation to institute a pause), but even if I did, I still want to assign them blame for in fact being the ones taking the risky actions. This feels more true for me the fewer actors there are, i.e., at the point when there are only three big labs I think each of them is significantly contributing to risk, whereas if there were hundreds of leading labs I’d be less upset by any individual one. But there’s still a part of me that feels deontological about it, too—a sense that you’re just really not supposed to take actions that risky, no matter how inculpable you are counterfactually speaking.

Likewise, I have similar feelings about scientific discoveries. The people who did them are in fact the ones who did the work, and that matters to me. It matters more the smaller the set of possible people is, of course, but there’s some level upon which I want to be like “look they did an awesome thing here; it in fact wasn’t other people, and I want to assign them credit for that.” It’s related to a sense I have that doing great work is just really hard and that people perpetually underestimate this difficulty. For instance, people sometimes write off any good Musk has done (e.g., the good for climate change by creating Tesla, etc.) by saying “someone else would have made Tesla anyway” and I have to wonder, “really?” I certainly don’t look at the world and expect to see Teslas popping up everywhere. Likewise, I don’t look at the world and expect to see tons of leading AI labs, nor do I expect to see hundreds of people pushing the envelope on understanding what minds are. Few people try to do great things, and I think the set of people who might have done any particular great thing is often quite small.

Comment by aysja on The case for ensuring that powerful AIs are controlled · 2024-01-29T18:32:34.193Z · LW · GW

The incremental approach bakes in a few assumptions, namely that there likely won't be any sharp left turns, that the work you do on smaller systems will apply to the next ones you build, and so on. I think the problem is that we don't know that this will hold, and that there's reason to suspect it won't. And if we do in fact live in a world where our systems undergo sudden and extreme shifts in intelligence, then the iterative approach is imo very risky.

I think “solving alignment” is not exactly the right phrase, but I suspect it’s closer to what we want, here, than “thousands of people incrementally contributing.” When fields are thoroughly confused, thousands of people contributing mostly just amounts to more confusion, especially when everyone is working within a broken paradigm. And often in this situation—when new fields are needed—substantial scientific progress happens because of one or a few individuals. I don’t expect a single person to solve the entire problem by themselves, of course, but I do think it’s plausible that one or a few people will develop a robust underlying theory that marks substantial progress towards “solving alignment.”

Comment by aysja on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-26T18:46:29.430Z · LW · GW

Thanks for writing this up! I've been considering writing something in response to AI is easy to control for a while now, in particular arguing against their claim that "If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior." I think Section 4 does a good job of explaining why this probably isn't true, with the basic problem being that the space of behaviors consistent with the training data is larger than the space of behaviors you might "desire." 

Like, sure, if you have a mapping from synapses to desired behavior, okay—but the key word there is "desired" and at that point you're basically just describing having solved mechanistic interpretability. In the absence of knowing exactly how synapses/weights/etc map onto the desired behavior, you have to rely on the behavior in the training set to convey the right information. But a) it's hard to know that the desired behavior is "in" the training set in a very robust way and b) even if it were you might still run into problems like deception, not generalizing to out of distribution data, etc. Anyway, thanks for doing such a thorough write-up of it :) 

Comment by aysja on "Does your paradigm beget new, good, paradigms?" · 2024-01-25T19:32:37.310Z · LW · GW

I think the guiding principle behind whether or not scientific work is good should probably look something more like “is this getting me closer to understanding what’s happening” where “understanding” is something like “my measurements track the thing in a one to one lock-step with reality because I know the right typings and I’ve isolated the underlying causes well enough.”

AI control doesn’t seem like it’s making progress on that goal, which is certainly not to say it’s not important—it seems good to me to be putting some attention on locally useful things. Whereas the natural abstractions agenda does feel like progress on that front.

As an aside: I dislike basically all words about scientific progress at this point. I don’t feel like they’re precise enough and it seems easy to get satiated on them and lose track of what’s actually important which is, imo, absolute progress on the problem of understanding what the fuck is going on with minds. Calling this sort of work “science” risks lumping it in with every activity that happens in e.g., academia, and that isn’t right. Calling it “pre-paradigmatic” risks people writing it off as “Okay so people just sit around being confused for years? How could that be good?”

I wish we had better ways of talking about it. I think that more precisely articulating what our goals are with agent foundations/paradigmaticity/etc could be very helpful, not only for people pursuing it, but for others to even have a sense of what it might mean for field founding science to help in solving alignment. As it is, it seems to often get rounded off to “armchair philosophy” or “just being sort of perpetually confused” which seems bad.

Comment by aysja on My research agenda in agent foundations · 2024-01-25T00:47:41.466Z · LW · GW

I worry that overemphasizing fast feedback loops ends up making projects more myopic than is good for novel research. Like, unfortunately for everyone, a bunch of what makes good research is good research taste, and no one really understands what that is or how to get it and so it's tricky to design feedback loops to make it better. Like, I think the sense of "there's something interesting here that no one else is seeing" is sort of inherently hard to get feedback on, at least from other people. Because if you could, or if it were easy to explain from the get-go, then probably other people would have already thought it. You can maybe get feedback from the world, but it often takes awhile to turn that sense into something concrete. E.g., Einstein was already onto the idea that there was something strange about light and relativity from when he was 16, but he didn't have a shippable idea until about ten years later.  

I don't think it always takes ten years, but deconfusion work is just... weird, e.g., Einstein was prone to bouts of "psychic tension," and would spend weeks in a "state of confusion." He wasn't part of academia, shared his thoughts with few people, and did very little experimentation. Which isn't to say that there was literally no feedback involved, there was, but it's a complicated story which I think unfortunately involves a bunch of feedback directly from his research taste (i.e., his taste was putting constraints like "unity," "logical," "physical meaning" on the hypothesis space). And this certainly isn't to say "let's not try to make it go faster," obviously, if one could make research faster that seems great. But I think it's a careful balancing act, and I worry that putting too much pressure on speed and legibility is going to end up causing people to do science under the streetlight. I really do not want this to happen. Field founding science is a bunch weirder than normal science, and I want to take care in giving research taste enough space to find its feet. 

Comment by aysja on My research agenda in agent foundations · 2024-01-25T00:26:46.695Z · LW · GW

My impression is that Alex is trying to figure out what things like "optimization" are actually like, and that this analysis will apply to a wider variety of systems than just ML. Which makes sense to me—imo, anchoring too much on current systems seems unlikely to produce general, robust solutions to alignment. 

Comment by aysja on A model of research skill · 2024-01-25T00:03:44.195Z · LW · GW

Seconded! I love Holden's posts on wicked problems, I revisit them like once a week or whenever I'm feeling down about my work :p

I've also found it incredibly useful to read historical accounts of great scientists. There's just all kinds of great thinking tips scattered among biographies, many of which I've encountered on LessWrong before, but somehow seeing them in the context of one particular intellectual journey is very helpful. 

Reading Einstein's biography (by Walter Isaacson) was by far my favorite. I felt like I got a really good handle for his style of thinking (e.g., how obsessed he was with unity—like how he felt it “unbearable” that there should be an essential difference between a magnet moving through a conducting coil and a coil moving around a magnet, although the theories at the time posited such a difference; his insistence on figuring out the physical meaning of things—with special relativity, this was the operationalization of "time," with quanta this was giving meaning to an otherwise mathematical curiosity that Planck had discovered; his specific style of thought experiments; and just a sense of how wonderful and visceral his curiosity about the world was, like how as a very young child his father brought him a compass, and as he watched the needle align due to some apparently hidden force field he trembled and grew cold at the prospect of non-mechanical causes). He's so cool! 

Comment by aysja on The Plan - 2023 Version · 2024-01-24T23:45:43.468Z · LW · GW

Yeah, I think I misspoke a bit. I do think that controllability is related to understanding, but also I’m trying to gesture at something more like “controllability implies simpleness.”

Where, I think what I’m tracking with “controllability implies simpleness” is that the ease with which we can control things is a function of how many causal factors there are in creating it, i.e., “conjunctive things are less likely” in some Occam’s Razor sense, but also conjunctive things “cost more.” At the very least, they cost more from the agents point of view. Like, if I have to control every step in some horror graph in order to get the outcome I want, that’s a lot of stuff that has to go perfectly, and that’s hard. If “I,” on the other hand, as a virus, or cancer, or my brain, or whatever, only have to trigger a particular action to set off a cascade which reliably results in what I want, this is easier.

There is a sense, of course, in which the spaghetti code thing is still happening underneath it all, but I think it matters what level of abstraction you’re taking with respect to the system. Like, with enough zoom, even very simple activities start looking complex. If you looked at a ball following a parabolic arc at the particle level, it’d be pretty wild. And yet, one might reasonably assume that the regularity with which the ball follows the arc is meaningful. I am suspicious that much of the “it’s all ad-hoc” intuitions about biology/neuroscience/etc are making a zoom error (also typing errors, but that’s another problem). Even simple things can look complicated if you don’t look at them right, and I suspect that the ease with which we can operate in our world should be some evidence that this simpleness “exists,” much like the movement of the ball is “simple” even though a whole world of particles underlies its behavior.

To your point, I do think that the ability to control something is related to understanding it, but not exactly the same. Like, there indeed might be things in the environment that we don’t have much control over, even if we can clearly see what their effects are. Although, I’d be a little surprised if there were true for, e.g., obesity. Like, it seems likely that hormones (semaglutide) can control weight loss, which makes sense to me. A bunch of global stuff in bodies is regulated by hormones, in a similar way, I think, to how viruses “hook into” the cells machinery, i.e., it reliably triggers the right cascades. And controlling something doesn’t necessarily mean that it's possible to understand it completely. But I do suspect that the ability to control implies the existence of some level of analysis upon which the activity is pretty simple.

Comment by aysja on Recursive Middle Manager Hell · 2024-01-20T20:35:10.522Z · LW · GW

Are those things that good? I don't feel like I notice a huge quality of life difference from the pens I used ten years ago versus the pens I use now. Same with laptops and smartphones (although I care unusually little about that kind of thing so maybe I'm just not tracking it). Medicines have definitely improved although it seems worth noting that practically everyone I know has some terrible health problem they can't fix and we all still die. 

I feel like pushing the envelope on feature improvements is way easier than pushing the envelope on fundamental progress, and progress on the former seems compatible, to me, with pretty broken institutions. In some respects, small feature improvements is what you'd expect from middle manager hell, kind of like the lowest common denominator of a legible signal that you're doing something. It's true that these companies probably wouldn't exist if they were all around terrible. But imo it's more that they become pretty myopic and lame relative to what they could be. I think academia has related problems, too. 

Comment by aysja on Recursive Middle Manager Hell · 2024-01-20T20:23:02.195Z · LW · GW

Google Brain was developed as part of X (Google's "moonshot factory"), which is their way of trying to create startups/startup culture within a large corporation. So was Waymo. 

Comment by aysja on Against most, but not all, AI risk analogies · 2024-01-14T06:42:51.870Z · LW · GW

They establish almost nothing of importance about the behavior and workings of real AIs, but nonetheless give the impression of a model for how we should think about AIs. 

How do you know that they establish nothing of importance? 

Many proponents of AI risk seem happy to critique analogies when they don't support the desired conclusion, such as the anthropomorphic analogy. 

At the very least, this seems to go both ways. Like, afaict, one of Quintin and Nora’s main points in “AI is Easy to Control” is that aligning AI is pretty much just like aligning humans, with the exception that we (i.e., backpropagation) have full access to the weights which makes aligning AI easier. But is aligning a human pretty much like aligning an AI? Can we count on the AI to internalize our concepts in the same way? Do humans come with different priors that make them much easier to “align”? Is the dissimilarity “AI might be vastly more intelligent and powerful than us” not relevant at all, on this question? Etc. But I don’t see them putting much rigor into that analogy—it’s just something that they assume and then move on. 

My point is that we should stop relying on analogies in the first place. Use detailed object-level arguments instead!

It seems reasonable, to me, to request more rigor when using analogies. It seems pretty wild to request that we stop relying on them altogether, almost as if you were asking us to stop thinking. Analogies seem so core to me when developing thought in novel domains, that it’s hard to imagine life without them. Yes, there are many ways AI might be. That doesn’t mean that our present world has nothing to say about it. E.g., I agree that evolution differs from ML in some meaningful ways. But it also seems like a mistake to completely throw out a major source of evidence we have about how intelligence was produced. Of course there will be differences. But no similarities? And do those similarities tell us nothing about the intelligences we might create? That seems like an exceedingly strong claim. 

Comment by aysja on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training · 2024-01-14T04:45:35.371Z · LW · GW

I think lots of folks (but not all) would be up in arms, claiming "but modern results won't generalize to future systems!" And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it's socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I'm being too cynical, but that's my reaction.

Fwiw, my reaction to something like “we can finetune the AI to be nice in a stable way” is more like—but is it actually “nice”? I.e., I don’t feel like we’re at all clear on what “niceness” is, and behavioral proxies to that effect feel like some, but only pretty weak evidence about it.

This is my basic concern with evaluations, too. At the moment they just don’t seem at all robust enough for me to feel confident about what the results mean. But I see the sleeper agents work as progress towards the goal of “getting better at thinking about deception” and I feel pretty excited about that.

I think it’s reasonable to question whether these systems are in fact deceptive (just as I would with “niceness”). But when I evaluate results it’s not like “is this an optimistic or pessimistic update” and more like “does it seem like we understand more about X than we did before?” I think we understand more about deception because of this work, and I think that’s cool and important.

Comment by aysja on Gentleness and the artificial Other · 2024-01-13T03:52:16.923Z · LW · GW

This post is so wonderful, thank you for writing it. I’ve gone back to re-read many paragraphs over and over.

A few musings of my own:

“It’s just” … something. Oh? So eager, the urge to deflate. And so eager, too, the assumption that our concepts carve, and encompass, and withstand scrutiny. It’s simple, you see. Some things, like humans, are “sentient.” But Bing Sydney is “just” … you know. Actually, I don’t. What were you going to say? A machine? Software? A simulator? “Statistics?”

This has long driven me crazy. And I think you’re right about the source of the eagerness, although I suspect that mundanity is playing a role here, too. I suspect, in other words, that people often mistake the familiar for the understood—that no matter how strange some piece of reality is, if it happens frequently enough people come to find it normal; and hence, on some basic level, explained.

Like you, I have felt mesmerized by ctenophores at the Monterey Aquarium. I remember sitting there for an hour, staring at these curious creatures, watching their bioluminescent LED strips flicker as they gently floated in the blackness. It was so surreal. And every few minutes, this psychedelic experience would be interrupted by screaming children. Most of them would run up to the exhibit for a second, point, and then run on as their parents snapped a few pictures. Some would say “Mom, I’m bored, can we look at the otters?” And occasionally a couple would murmur to each other “That’s so weird.” But most people seemed unfazed.  

I’ve been unfazed at times, too. And when I am, it’s usually because I’m rounding off my experience to known concepts. “Oh, a fish-type thing? I know what that’s like, moving on.” As if “fish-type thing” could encompass the piece of reality behind the glass. Whereas when I have these ethereal moments of wonder—this feeling of brushing up against something that’s too huge to hold—I am dropping all of that. And it floods in, the insanity of it all—that “I” am a thing, watching this strange, flickering creature in front of me, made out of similar substances and yet so wildly different. So gossamer, the jellies are—and containing, presumably, experience. What could that be like? 

“Justs” are all too often a tribute to mundanity—the sentiment that the things around us are normal and hence, explained? And it’s so easy for things to seem normal when your experience of the world is smooth. I almost always feel myself mundane, for instance. Like a natural kind. I go to the store, call my friends, make dinner. All of it is so seamless—so regular, so simple—that it’s hard to believe any strangeness could be lurking beneath. But then, sometimes, the wonder catches me, and I remember how glaringly obvious it is that minds are the most fascinating phenomenon in the universe. I remember how insane it is—that some lumps of matter are capable of experience, of thought, of desire, of making reality bend to those desires. Are they? What does that mean? How could I be anything at all?

Minds are so weird. Not weird in the “things don’t add up to normality” way—they do. Just that, being a lump of matter like this is a deeply strange endeavor. And I fear that our familiarity with our selves blinds us to this fact. Just as it blinds us to how strange these new minds—this artificial Other, might be. And how tempting, it is, to take the thing that is too huge to hold and to paper over it with a “just” so that we may feel lighter. To mistake our blindness for understanding. How tragic a thing, to forego the wonder.