martinsq

It happened so discontinuously that we couldn't avoid it even with our concentrated effort
It happened slowly but for some reason we didn't make a concentrated effort. This could be because:
1. We didn't notice it (e.g. intelligence explosion inside lab)
2. We couldn't coordinate a concentrated effort, even if we all individually would want it to exist (e.g. no way to ensure China isn't racing faster)
3. We didn't act individually rationally (e.g. Trump doesn't listen to advisors / Trump brainwashed by AI)

1 seems unlikelier by the day.
2a is mostly transparency inside labs (and less importantly into economic developments), which is important but at least some people are thinking about it.
There's a lot to think through in 2b and 2c. It might be critical to ensure early takeoff improves them, rather than degrading them (missing any drastic action to the contrary) until late takeoff can land the final blow.
If we assume enough hierarchical power structures, the situation simplifies into "what 5 world leaders do", and then it's pretty clear you mostly want communication channels and trustless agreements for 2b, and improving national decision-making for 2c.
Maybe what I'd be most excited to see from the "systemic risk" crowd is detailed thinking and exemplification on how assuming enough hierarchical power structures is wrong (that is, the outcome depends strongly on things other than what those 5 world leaders do), what are the most x-risk-worrisome additional dynamics in that area, and how to intervene on them.
(Maybe all this is still too abstract, but it cleared my head)

Comment by Martín Soto (martinsq) on Weird Random Newcomb Problem · 2025-04-13T09:28:50.845Z · LW · GW

No, the utility here is just the amount of money gets

I meant that it sounded like you "wanted a better average score (over as) when you are randomly sampled as b than other programs". Although again I think the intuition-pumping is misleading here because the programmer is choosing which b to fix, but not which a to fix. So whether you wanna one-box only depends on whether you condition on a = b.

Comment by Martín Soto (martinsq) on Weird Random Newcomb Problem · 2025-04-12T10:23:31.593Z · LW · GW

(Just skimmed, also congrats on the work)

Why is this surprising? You're basically assuming that there is no correlation between what program Omega predicts and what program you actually are. That is, Omega is no predictor at all! Thus, obviously you two-box, because one-boxing would have no effect on what Omega predicts. (Or maybe the right way to think about this is: it will have a tiny but non-zero effect, because you are one of the |P| programs, but since |P| is huge, that is ~0.)

When instead you condition on a = b, this becomes a different problem: Omega is now a perfect predictor! So you obviously one-box.

Another way to frame this is whether you optimize through the inner program or assume it fixed. From your prior, Omega samples randomly, so you obviously don't want to optimize through it. Not only because |P| is huge, but actually more importantly, because for any policy you might want to implement, there exists a policy that implements the exact opposite (and might be sampled just as likely as you), thus your effect nets out to exactly 0. But once you update on (condition on) seeing that actually Omega sampled exactly you (instead of your random twin, or any other old program), then you'd want to optimize through a! But, you can't have your cake and eat it too. You cannot shape yourself to reap that reward (in the world where Omega samples exactly you), without also shaping yourself to give up some reward (relative to other players) in the world where Omega samples your evil twin.
(Thus you might want to say: Aha! I will make my program behave differently depending on whether it is facing itself or an evil twin! But then... I can also create an evil twin relative to that more complex program. And we keep on like this forever. That is, you cannot actually, fully, ever condition your evil twin away, because you are embedded in Omega's distribution. I think this is structurally equivalent to some commitment race dynamics I discussed with James, but I won't get into that here.)

Secretly, I think this duality only feels counter-intuitive because it's an instance of dynamic inconsistency, = you want to take different actions once you update on information, = the globally (from the uncorrelated prior) optimal action is not always the same as the locally optimal action (from a particular posterior, like the one assuming correlation). Relatedly, I think the only reason your Universal framing differs from your Functional and Anthropic framings is exactly that they are (implicitly) using these two different distributions (one without correlation, the other with):

The Universal framing assumes that Omega samples randomly (no correlation).
The Functional framing assumes that "you have control over the inner program". But this is sneaking something in. You obviously have control over your own program. But you don't actually have any control over "the program that Omega will randomly sample from P" (because anything you do is cancelled by your evil twin). Thus, assuming you have control over the inner program, is equivalent to assuming that Omega sampled you. That is, yes correlation.
The Anthropic framing is also implicitly assuming "Omega samples you", although it's a bit more complicated since it also depends on your utility function (how much you care about different programs getting different amounts of money):
- If your distribution is truly "Omega samples programs at random", then when you observe two equal numbers, you are almost certain to be in a simulation. Given that, if you for example care equally about all programs getting as much money as possible, then of course you should one-box. That will entail that each random program (with a tiny probability) gets a million, which is a huge win. But the intuition that you were expressing in Question 2 ("p2 is better than p1 because it scores better") isn't compatible with "caring equally about all programs". Instead, it sounds as if you positively want to score better than other programs, that is, maximize your score and minimize theirs! If that's the case, then you obviously should two-box, since almost certainly you are subtracting a million from another program, not yourself. Even assuming, for simplicity, that you only care about beating one other program (as in the p1 and p2 example), you should two-box, because you are subtracting the same million dollars from both, but you are gaining a very slight edge with the thousand dollars.
- If your distribution is instead "a = b" (assumes correlation), then, regardless of whether you want to maximize everyone's payoff or you want to beat another program that could be sampled, you want to one-box, since the million dollar benefit is coming straight to you, and is bigger than the thousand dollar benefit.

Comment by Martín Soto (martinsq) on So how well is Claude playing Pokémon? · 2025-03-08T14:42:18.663Z · LW · GW

It's unclear what the optimal amount of thinking per step is. My initial guess would have been that letting Claude think for a whole paragraph before each single action (rather than only each 10 actions, or whenever it's in a match, or whatever) scores slightly better than letting it think more (sequentially). But I guess this might work better if it's what the streamer is using after some iteration.

The story for parallel checks could be different though. My guess would be going all out and letting Claude generate the paragraph 5 times and then generate 5 more parallel paragraphs about whether it has gotten something wrong, and then having a lower-context version of Claude decide whether there are any important disagreements, and if not just majority-vote, would improve robustness problems (like "I close a goal before actually achieving it"). But maybe this adds too much bloat and opportunities for mistakes, or makes some mistakes better but others way worse.

Comment by Martín Soto (martinsq) on Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems · 2025-02-19T10:59:30.188Z · LW · GW

I don't see how Take 4 is anything other than simplicity (in the human/computational language). As you say, it's a priori unclear whether a an agent is an instance of a human or the other way around. You say the important bit is that you are subtracting properties from a human to get an agent. But how shall we define subtraction here? In one formal language, the definition of human will indeed be a superset of that of agent... but in another one it will not. So you need to choose a language. And the natural way forward every time this comes up (many times), is to just "weigh by Turing computations in the real world" (instead of choosing a different and odd-looking Universal Turing Machine), that is, a simplicity prior.

Comment by Martín Soto (martinsq) on evhub's Shortform · 2025-02-05T12:03:07.766Z · LW · GW

Imo rationalists tend to underestimate the arbitrariness involved in choosing a CEV procedure (= moral deliberation in full generality).

Like you, I endorse the step of "scoping the reference class" (along with a thousand other preliminary steps). Preemptively fixing it in place helps you to the extent that the humans wouldn't have done it by default. But if the CEV procedure is governed by a group of humans so selfish/unthoughtful as to not even converge on that by themselves, then I'm sure that there'll be at lesat a few hundred other aspects (both more and less subtle than this one) that you and me obviously endorse, but they will not implement, and will drastically affect the outcome of the whole procedure.
In fact, it seems strikingly plausible that even among EAs, the outcome could depend drastically on seemingly-arbitrary starting conditions (like "whether we use deliberation-and-distillation procedure #194 or #635, which differ in some details"). And "drastically" means that, even though both outcomes still look somewhat kindness-shaped and friendly-shaped, one's optimum is worth <10% to the other's utility (or maybe, this holds for the scope-sensitive parts of their morals, since the scope-insensitive ones are trivial to satisfy).

To pump related intuitions about how difficult and arbitrary moral deliberation can get, I like Demski here.

Comment by Martín Soto (martinsq) on Gradual Disempowerment, Shell Games and Flinches · 2025-02-02T23:27:05.400Z · LW · GW

I'm sure some of people's ignorance of these threat models comes from the reasons. But my intuition is that most of it comes from "these are vaguer threat models that seem very up in the air, and other ones seem more obviously real and more shovel-ready" (this is similar to your "Flinch", but I think more conscious and endorsed).

Thus, I think the best way to converge on whether these threat models are real/likely/actionable is to work through as-detailed-as-possible example trajectories. Someone objects that the state will handle it? Let's actually think through how the state might look like in 5 years! Someone objects that democracy will prevent it? Let's actually think through the actual consequences of cheap cognitive labor in democracy!
This is analogous to what pessimists about single-single alignment have gone through. They have some abstract arguments, people don't buy them, so they start working through them in more detail or provide example failures. I buy some parts of them, but not others. And if you did the same for this threat model, I'm uncertain how much I'd buy!

Of course, the paper might have been your way of doing that. I enjoyed it, but still would have preferred more fully detailed examples, on top of the abstract arguments. You do use examples (both past and hypothetical), but they are more like "small, local examples that embody one of the abstract arguments", rather than "an ambitious (if incomplete) and partly arbitrary picture of how these abstract arguments might actually pan out in practice". And I would like to know the messy details of how you envision these abstract arguments coming into contact with reality. This is why I liked TASRA, and indeed I was more looking forward to an expanded, updated and more detailed version of TASRA.

Comment by Martín Soto (martinsq) on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development · 2025-01-31T20:05:01.162Z · LW · GW

Just writing a model that came to mind, partly inspired by Ryan here.

Extremely good single-single alignment should be highly analogous to "current humans becoming smarter and faster thinkers".

If this happens at roughly the same speed for all humans, then each human is better equipped to increase their utility, but does this lead to a higher global utility? This can be seen as a race between the capability (of individual humans) to find and establish better coordination measures, and their capability to selfishly subvert these coordination measures. I do think it's more likely than not that the former wins, but it's not guaranteed.
Probably someone like Ryan believes most of those failures will come in the form of explicit conflict or sudden attacks. I can also imagine slower erosions of global utility, for example by safe interfaces/defenses between humans becoming unworkable slop into which most resources go.

If this doesn't happen at roughly the same speed for all humans, you also get power imbalance and its consequences. One could argue that differences in resources between humans will augment, in which case this is the only stable state.

If instead of perfect single-single alignment we get the partial (or more taxing) fix I expect, the situation degrades further. Extending the analogy, this would be the smart humans sometimes being possessed by spirits with different utilities, which not only has direct negative consequences but could also complicate coordination once it's common knowledge.

Comment by Martín Soto (martinsq) on Catastrophe through Chaos · 2025-01-31T17:19:17.724Z · LW · GW

Fantastic snapshot. I wonder (and worry) whether we'll look back on it with similar feelings as those we have for What 2026 looks like now.

There is also no “last resort war plan” in which the president could break all of the unstable coordination failures and steer the ship.
[...]
There are no clear plans for what to do under most conditions, e.g. there is no clear plan for when and how the military should assume control over this technology.

These sound intuitively unlikely to me, by analogy to nuclear or bio. Of course, that is not to say these protocols will be sufficient or even sane, by analogy to nuclear or bio.

This makes it really unclear what to work on.

It's not super obvious to me that there won't be clever ways to change local incentives / improve coordination, and successful interventions in this direction would seem incredibly high-leveraged, since they're upstream of many of the messy and decentralized failure modes. If they do exist, they probably look not like "a simple cooridnation mechanism", and more like "a particular actor gradually steering high-stakes conversations (through a sequence of clever actions) to bootstrap minimal agreements". Of course, similarity to past geopolitical situations does make it seem unlikely on priors.

There is no time to get to very low-risk worlds anymore. There is only space for risk reduction along the way.

My gut has been in agreement for some time that the most cost-effective x-risk reduction now probably looks like this.

Comment by Martín Soto (martinsq) on AI companies are unlikely to make high-assurance safety cases if timelines are short · 2025-01-27T21:35:04.149Z · LW · GW

I agree with conjunctiveness, although again more optimistic about huge improvements. I mostly wanted to emphasize that I'm not sure there are structurally robust reasons (as opposed to personal whims) why huge spendings on safety won't happen

Comment by Martín Soto (martinsq) on Tell me about yourself: LLMs are aware of their learned behaviors · 2025-01-24T19:26:03.430Z · LW · GW

Speaking for myself (not my coauthors), I don't agree with your two items, because:

if your models are good enough at code analysis to increase their insecurity self-awareness, you can use them in other more standard and efficient ways to improve the dataset
doing self-critique the usual way (look over your own output) seems much more fine-grained and thus efficient than asking the model whether it "generally uses too many try-excepts"

More generally, I think behavioral self-awareness for capability evaluation is and will remain strictly worse than the obvious capability evaluation techniques.

That said, I do agree systematic inclusion of considerations about negative externalities should be a norm, and thus we should have done so. I will shortly say now that a) behavioral self-awareness seems differentially more relevant to alignment than capabilities, and b) we expected lab employees to find out about this themselves (in part because this isn't surprising given out-of-context reasoning), and we in fact know that several lab employees did. Thus I'm pretty certain the positive externalities of building common knowledge and thinking about alignment applications are notably bigger.

Comment by Martín Soto (martinsq) on AI companies are unlikely to make high-assurance safety cases if timelines are short · 2025-01-24T19:11:53.083Z · LW · GW

Most difficulties you raise here could imo change drastically with tens of billions being injected into AI safety, especially thanks to new ideas coming out of left field that might make safety cases way more efficient. (I'm probably more optimistic about new ideas than you, partly because "it always subjectively feels like there are no big ideas left", and AI safety is so young.)
If your government picks you as a champion and gives you amazing resources, you no longer have to worry about national competition, and that amount seems doable. You still have to worry about international competition, but will you feel so closely tied that you can't even spare that much? My guess would be no. That said, I still don't expect certain lab leaders to want to do this.

The same is not true of security though, that's a tough one.

Comment by Martín Soto (martinsq) on Eliciting bad contexts · 2025-01-24T13:27:46.634Z · LW · GW

See our recent work (especially section on backdoors) which opens the door to directly asking the model. Although there are obstacles like Reversal Curse and it's unclear if it can be made to scale.

Comment by Martín Soto (martinsq) on A breakdown of AI capability levels focused on AI R&D labor acceleration · 2024-12-23T17:40:24.753Z · LW · GW

I have two main problems with t-AGI:

A third one is a definitory problem exacerbated by test-time compute: What does it mean for an AI to succeed at task T (which takes humans X hours)? Maybe it only succeeds when an obscene amount of test-time compute is poured. It seems unavoidable to define things in terms of resources as you do

Comment by Martín Soto (martinsq) on Automation collapse · 2024-11-12T10:42:03.585Z · LW · GW

Very cool! But I think there's a crisper way to communicate the central point of this piece (or at least, a way that would have been more immediately transparent to me). Here it is:

Say you are going to use Process X to obtain a new Model. Process X can be as simple as "pre-train on this dataset", or as complex as "use a bureaucracy of Model A to train a new LLM, then have Model B test it, then have Model C scaffold it into a control protocol, then have Model D produce some written arguments for the scaffold being safe, have a human read them, and if they reject delete everything". Whatever Process X is, you have only two ways to obtain evidence that Process X has a particular property (like "safety"): looking a priori at the spec of Process X (without running it), or running (parts of) Process X and observing its outputs a posteriori. In the former case, you clearly need an argument for why this particular spec has the property. But in the latter case, you also need an argument for why observing those particular outputs ensures the property for this particular spec. (Pedantically speaking, this is just Kuhn's theory-ladenness of observations.)

Of course, the above reasoning doesn't rule out the possibility that the required arguments are pretty trivial to make. That's why you summarize some well-known complications of automation, showing that the argument will not be trivial when Process X contains a lot of automation, and in fact it'd be simpler if we could do away with the automation.

It is also the case that the outputs observed from Process X might themselves be human-readable arguments. While this could indeed alleviate the burden of human argument-generation, we still need a previous (possibly simpler) argument for why "a human accepting those output arguments" actually ensures the property (especially given those arguments could be highly out-of-distribution for the human).

Comment by Martín Soto (martinsq) on Winning isn't enough · 2024-11-06T12:49:55.235Z · LW · GW

My understanding from discussions with the authors (but please correct me):

This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.

Maybe it's easiest if I explain what this post grows out of:

There seems to be a widespread vibe amongst rationalists that "one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply win". This vibe is no coincidence, since Eliezer and Nate, in some of their writing about FDT, use language strongly implying that decision theory A is objectively better than decision theory B because it just wins more. Unfortunately, this intuitive notion of winning cannot actually be made into a philosophically valid objective metric. (In more detail, a precise definition of winning is already decision-theory-complete, so these arguments beg the question.) This point is well-known in philosophical academia, and was already succinctly explained in a post by Caspar (which the authors mention).

In the current post, the authors extend a similar philosophical critique to other widespread uses of winning, or background assumptions about rationality. For example, some people say that "winning is about not playing dominated strategies"... and the authors agree about avoiding dominated strategies, but point out that this is not too action-guiding, because it is consistent with many policies. Or also, some people say that "rationality is about implementing the heuristics that have worked well in the past, and/or you think will lead to good future performance"... but these utterances hide other philosophical assumptions, like assuming the same mechanisms are at play in the past and future, which are especially tenuous for big problems like x-risk. Thus, vague references to winning aren't enough to completely pin down and justify behavior. Instead, we fundamentally need additional constraints or principles about normativity, what the authors call non-pragmatic principles. Of course, these principles cannot themselves be justified in terms of past performance (which would lead to circularity), so they instead need to be taken as normative axioms (just like we need ethical axioms, because ought cannot be derived from is).

Comment by Martín Soto (martinsq) on What's a good book for a technically-minded 11-year old? · 2024-10-22T13:08:09.193Z · LW · GW

GEB

Comment by Martín Soto (martinsq) on My motivation and theory of change for working in AI healthtech · 2024-10-12T14:26:50.510Z · LW · GW

Like Andrew, I don't see strong reasons to believe that near-term loss-of-control accounts for more x-risk than medium-term multi-polar "going out with a whimper". This is partly due to thinking oversight of near-term AI might be technically easy. I think Andrew also thought along those lines: an intelligence explosion is possible, but relatively easy to prevent if people are scared enough, and they probably will be. Although I do have lower probabilities than him, and some different views on AI conflict. Interested in your take @Daniel Kokotajlo

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-10-11T09:39:50.357Z · LW · GW

You know that old thing where people solipsistically optimizing for hedonism are actually less happy? (relative to people who have a more long-term goal related to the external world) You know, "Whoever seeks God always finds happiness, but whoever seeks happiness doesn't always find God".

My anecdotal experience says this is very true. But why?

One explanation could be in the direction of what Eliezer says here (inadvertently rewarding your brain for suboptimal behavior will get you depressed):

Someone with a goal has an easier time getting out of local minima, because it is very obvious those local minima are suboptimal for the goal. For example, you get out of bed even when the bed feels nice. Whenever the ocasional micro-breakdown happens (like feeling a bit down), you power through for your goal anyway (micro-dosing suffering as a consequence), so your brain learns that micro-breakdowns only ever lead to bad immediate sensations and fixes them fast.

Someone whose only objective is the satisfaction of their own appetites and desires has a harder time reasoning themselves out of local optima. Sure, getting out of bed allows me to do stuff that I like. But those feel distant now, and the bed now feels comparably nice... You are now comparing apples to apples (unlike someone with an external goal), and sometimes you might choose the local optimum. When the ocasional micro-breakdown happens, you are more willing to try to soften the blow and take care of the present sensation (instead of getting over the bump quickly), which rewards in the wrong direction.

Another possibly related dynamic: When your objective is satisfying your desires, you pay more conscious attention to your desires, and this probably creates more desires, leading to more unsatisfied desires (which is way more important than the amount of satisfied desires?).

Comment by Martín Soto (martinsq) on Decision Theory in Space · 2024-09-14T19:44:47.192Z · LW · GW

hahah yeah but the only point here is: it's easier to credibly commit to a threat if executing the threat is cheap for you. And this is simply not too interesting a decision-theoretic point, just one more obvious pragmatic consideration to throw into the bag. The story even makes it sound like "Vader will always be in a better position", or "it's obvious that Leia shouldn't give in to Tarkin but should give in to Vader", and that's not true. Even though Tarkin loses more from executing the threat than Vader, the only thing that matters for Leia is how credible the threat is. So if Tarkin had any additional way to make his commitment credible (like program the computer to destroy Alderaan if the base location is not revealed), then there would be no difference between Tarkin and Vader. The fact that "Tarkin might constantly reconsider his decision even after claiming to commit" seems like a contingent state of affairs of human brains (or certain human brains in certain situations), not something important in the grander scheme of decision theory.

Comment by Martín Soto (martinsq) on Book Recommendations for social skill development? · 2024-08-21T23:46:50.481Z · LW · GW

Comment by Martín Soto (martinsq) on Decision Theory in Space · 2024-08-18T20:46:20.238Z · LW · GW

The only decision-theoretic points that I could see this story making are pretty boring, at least to me.

Comment by Martín Soto (martinsq) on In Defense of Open-Minded UDT · 2024-08-13T23:00:37.994Z · LW · GW

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it's misleading to imply we can surmount it. It's great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I've felt this was obscured in many relevant conversations.

This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

My point is that the theoretical work you are shooting for is so general that it's closer to "what sorts of AI designs (priors and decision theories) should always be implemented", rather than "what sorts of AI designs should humans in particular, in this particular environment, implement".
And I think we won't gain insights on the former, because there are no general solutions, due to fundamental trade-offs ("no-free-lunchs").
I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/"eye-balling"/iterating.

Comment by Martín Soto (martinsq) on In Defense of Open-Minded UDT · 2024-08-13T19:24:54.738Z · LW · GW

Excellent explanation, congratulations! Sad I'll have to miss the discussion.

Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.

You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don't, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?

Me: I'm not sure if that's exactly the condition, but at least it motivates the idea that there's some condition differentiating when we should be updateful vs updateless. I think uncertainty about "our own beliefs" is subtly wrong; it seems more like uncertainty about which beliefs we endorse.

This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can't differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on "our own beliefs" or "which beliefs I endorse"? After all, that's just one more part of reality (without a clear boundary separating it).

It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can't know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: "I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I'm in the Infinite Counterlogical Mugging... then I will just eventually change my prior because I noticed I'm in the bad world!". But then again, why would we think this update is safe? That's just not being updateless, and losing out on the strategic gains from not updating.

Since a solution doesn't exist in full generality, I think we should pivot to more concrete work related to the "content" (our particular human priors and our particular environment) instead of the "formalism". For example:

Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
- Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-08-13T19:05:03.838Z · LW · GW

I think Nesov had some similar idea about "agents deferring to a (logically) far-away algorithm-contract Z to avoid miscoordination", although I never understood it completely, nor think that idea can solve miscoordination in the abstract (only, possibly, be a nice pragmatic way to bootstrap coordination from agents who are already sufficiently nice).

EDIT 2: UDT is usually prone to commitment races because it thinks of each agent in a conflict as separately making commitments earlier in logical time. But focusing on symmetric commitments gets rid of this problem.

Hate to always be that guy, but if you are assuming all agents will only engage in symmetric commitments, then you are assuming commitment races away. In actuality, it is possible for a (meta-) commitment race to happen about "whether I only engage in symmetric commitments".

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-08-13T19:01:20.119Z · LW · GW

I don't understand your point here, explain?

Say there are 5 different veils of ignorance (priors) that most minds consider Schelling (you could try to argue there will be exactly one, but I don't see why).

If everyone simply accepted exactly the same one, then yes, lots of nice things would happen and you wouldn't get catastrophically inefficient conflict.

But every one of these 5 priors will have different outcomes when it is implemented by everyone. For example, maybe in prior 3 agent A is slightly better off and agent B is slightly worse off.

So you need to give me a reason why a commitment race doesn't recur in the level of "choosing which of the 5 priors everyone should implement". That is, maybe A will make a very early commitment to only every implement prior 3. As always, this is rational if A thinks the others will react a certain way (give in to the threat and implement 3). And I don't have a reason to expect agents not to have such priors (although I agree they are slightly less likely than more common-sensical priors).

That is, as always, the commitment races problem doesn't have a general solution on paper. You need to get into the details of our multi-verse and our agents to argue that they won't have these crazy priors and will coordinate well.

This seems to be claiming that in some multiverses, the gains to powerful agents from being hawkish outweigh the losses to weak agents. But then why is this a problem? It just seems like the optimal outcome.

It seems likely that in our universe there are some agents with arbitrarily high gains-from-being-hawkish, that don't have correspondingly arbitrarily low measure. (This is related to Pascalian reasoning, see Daniel's sequence.) For example, someone whose utility is exponential on number of paperclips. I don't agree that the optimal outcome (according to my ethics) is for me (who's utility is at most linear on happy people) to turn all my resources into paperclips.
Maybe if I was a preference utilitarian biting enough bullets, this would be the case. But I just want happy people.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-08-09T16:47:00.273Z · LW · GW

Nice!

Proposal 4: same as proposal 3 but each agent also obeys commitments that they would have made from behind a veil of ignorance where they didn't yet know who they were or what their values were. From that position, they wouldn't have wanted to do future destructive commitment races.

I don't think this solves Commitment Races in general, because of two different considerations:

Trivially, I can say that you still have the problem when everyone needs to bootstrap a Schelling veil of ignorance.
Less trivially, even behind the most simple/Schelling veils of ignorance, I find it likely that hawkish commitments are incentivized. For example, the veil might say that you might be Powerful agent A, or Weak agent B, and if some Powerful agents have weird enough utilities (and this seems likely in a big pool of agents), hawkishly committing in case you are A will be a net-positive bet.

This might still mostly solve Commitment Races in our particular multi-verse. I have intuitions both for and against this bootstrapping being possible. I'd be interested to hear yours.

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-08-03T23:33:02.940Z · LW · GW

I have no idea whether Turing's original motivation was this one (not that it matters much). But I agree that if we take time and judge expertise to the extreme we get what you say, and that current LLMs don't pass that. Heck, even a trick as simple as asking for a positional / visual task (something like ARC AGI, even if completely text-based) would suffice. But I still would expect academics to be able to produce a pretty interesting paper on weaker versions of the test.

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-08-02T23:32:53.477Z · LW · GW

Why isn't there yet a paper in Nature or Science called simply "LLMs pass the Turing Test"?

I know we're kind of past that, and now we understand LLMs can be good at some things while bad at others. And the Turing Test is mainly interesting for its historical significance, not as the most informative test to run on AI. And I'm not even completely sure how much current LLMs pass the Turing Test (it will depend massively on the details of your Turing Test).

But my model of academia predicts that, by now, some senior ML academics would have paired up with some senior "running-experiments-on-humans-and-doing-science-on-the-results" academics (and possibly some labs), and put out an extremely exhaustive and high quality paper actually running a good Turing Test. If anything so that the community can coordinate around it, and make recent advancements more scientifically legible.

It's not either like the sole value of the paper would be publicity and legibility. There are many important questions around how good LLMs are at passing as humans for deployment. And I'm not thinking either of something as shallow as "prompt GPT4 in a certain way", but rather "work with the labs to actually optimize models for passing the test" (but of course don't release them), which could be interesting for LLM science.

The only thing I've found is this lower quality paper.

My best guess is that this project does already exist, but it took >1 year, and is now undergoing ~2 years of slow revisions or whatever (although I'd still be surprised they haven't been able to put something out sooner?).
It's also possible that labs don't want this kind of research/publicity (regardless of whether they are running similar experiments internally). Or deem it too risky to create such human-looking models, even if they wouldn't release them. But I don't think either of those is the case. And even if it was, the academics could still do some semblance of it through prompting alone, and probably it would already pass some versions of the Turing Test. (Now they have open-source models capable enough to do it, but that's more recent.)

Comment by Martín Soto (martinsq) on The need for multi-agent experiments · 2024-08-01T23:50:23.785Z · LW · GW

Thanks Jonas!

A way to combine the two worlds might be to run it in video games or similar where you already have players

Oh my, we have converged back on Critch's original idea for Encultured AI (not anymore, now it's health-tech).

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-07-29T23:05:27.937Z · LW · GW

You're right! I had mistaken the derivative for the original function.

Probably this slip happened because I was also thinking of the following:
Embedded learning can't ever be modelled as taking such an (origin-agnostic) derivative.
When in ML we take the gradient in the loss landscape, we are literally taking (or approximating) a counterfactual: "If my algorithm was a bit more like this, would I have performed better in this environment? (For example, would my prediction have been closer to the real next token)"
But in embedded reality there's no way to take this counterfactual: You just have your past and present observations, and you don't necessarily know whether you'd have obtained more or less reward had you moved your hand a bit more like this (taking the fruit to your mouth) or like that (moving it away).

Of course, one way to solve this is to learn a reward model inside your brain, which can learn without any counterfactuals (just considering whether the prediction was correct, or how "close" it was for some definition of close). And then another part of the brain is trained to approximate argmaxing the reward model.

But another effect, that I'd also expect to happen, is that (either through this reward model or other means) the brain learns a "baseline of reward" (the "origin") based on past levels of dopamine or whatever, and then reinforces things that go over that baseline, and disincentivizes those that go below (also proportionally to how far they are from the baseline). Basically the hedonic treadmill. I also think there's some a priori argument for this helping with computational frugality, in case you change environments (and start receiving much more or much less reward).

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-07-28T23:07:45.466Z · LW · GW

The default explanation I'd heard for "the human brain naturally focusing on negative considerations", or "the human body experiencing more pain than pleasure", was that, in the ancestral environment, there were many catastrophic events to run away from, but not many incredibly positive events to run towards: having sex once is not as good as dying is bad (for inclusive genetic fitness).

But maybe there's another, more general factor, that doesn't rely on these environment details but rather deeper mathematical properties:
Say you are an algorithm being constantly tweaked by a learning process.
Say on input X you produce output (action) Y, leading to a good outcome (meaning, one of the outcomes the learning process likes, whatever that means). Sure, the learning process can tweak your algorithm in some way to ensure that X -> Y is even more likely in the future. But even if it doesn't, by default, next time you receive input X you will still produce Y (since the learning algorithm hasn't changed you, and ignoring noise). You are, in some sense, already taking the correct action (or at least, an acceptably correct one).
Say on input X' you produce output Y', leading instead to a bad outcome. If the learning process changes nothing, next time you find X' you'll do the same. So the process really needs to change your algorithm right now, and can't fall back on your existing default behavior.

Of course, many other factors make it the case that such a naive story isn't the full picture:

Maybe there's noise, so it's not guaranteed you behave the same on every input.
Maybe the negative tweaks make the positive behavior (on other inputs) slowly wither away (like circuit rewriting in neural networks), so you need to reinforce positive behavior for it to stick.
Maybe the learning algorithm doesn't have a clear notion of "positive and negative", and instead just provides in a same direction (but with different intensities) for different intensities in a scale without origin. (But this seems ~~very different from the current paradigm,~~ and fundamentally wasteful.)

Still, I think I'm pointing at something real, like "on average across environments punishing failures is more valuable than reinforcing successes".

Comment by Martín Soto (martinsq) on This is already your second chance · 2024-07-28T22:19:18.948Z · LW · GW

Very fun

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-26T22:46:26.555Z · LW · GW

Now it makes sense, thank you!

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-20T20:58:07.710Z · LW · GW

Thanks! I don't understand the logic behind your setup yet.

Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words

But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is "generating only two words across all random seeds, and furthermore ensuring they have these probabilities".

The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds

My understanding of what you're saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow "relaxed" scoring (I don't know exactly how you did this) to be more lenient with this failure mode.

So my question is: if you faced the "problem" that the LLM didn't reliably output the same word pair (and wanted to solve this problem in some way), why didn't you change the prompt to stop encouraging the word pair dependence on the random seed?
Maybe what you're saying is that you indeed tried this, and even then there were many different word pairs (the change didn't make a big difference), so you had to "relax" scoring anyway.
(Even in this case, I don't understand why you'd include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-18T19:24:25.326Z · LW · GW

you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset

Hm, I was thinking something as easy to categorize as "multiplying numbers of n digits", or "the different levels of MMLU" (although again, they already know about MMLU), or "independently do X online (for example create an account somewhere)", or even some of the tasks from your paper.

I guess I was thinking less about "what facts they know", which is pure memorization (although this is also interesting), and more about "cognitively hard tasks", that require some computational steps.

Comment by Martín Soto (martinsq) on Me & My Clone · 2024-07-18T18:57:57.937Z · LW · GW

Given your clone is a perfectly mirrored copy of yourself down to the lowest physical level (whatever that means), then breaking symmetry would violate the homogeneity or isotropy of physics. I don't know where the physics literature stands on the likelihood of that happening (even though certainly we don't see macroscopic violations).

Of course, it might be an atom-by-atom copy is not a copy down to the lowest physical level, in which case trivially you can get eventual asymmetry. I mean, it doesn't even make complete sense to say "atom-by-atom copy" in the language of quantum mechanics, since you can't be arbitrarily certain about the position and velocity of each atom. Maybe saying something like "the quantum state function of the whole room is perfectly symmetric in this specific way". I think then (if that is indeed the lowest physical level) the function will remain symmetric forever, but maybe in some universes you and your copy end up in different places? That is, the symmetry would happen at another level in this example: across universes, and not necessarily inside each single universe?

It might also be there is no lowest physical level, just unending complexity all the way down (this had a philosophical name which I now forget).

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-17T17:47:37.945Z · LW · GW

Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as "hard for LLMs" (like tasks related to tokens and text position).

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-16T23:55:05.085Z · LW · GW

About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:

You say "use the seed to generate two new random rare words". But if I'm understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it's written, and the closeness of that excerpt to the random seed, I'd expect the LLM to "not notice" this, and automatically "try" to use the random seed to inform the choice of word pair.

Could this be impeding performance? Does it improve if you don't say that misleading bit?

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-07-12T02:38:48.964Z · LW · GW

I've noticed less and less posts include explicit Acknowledgments or Epistemic Status.

This could indicate that the average post has less work put into it: it hasn't gone through an explicit round of feedback from people you'll have to acknowledge. Although this could also be explained by the average poster being more isolated.

If it's true less work is put into the average post, it seems likely this means that kind of work and discussion has just shifted to private channels like Slack, or more established venues like academia.

I'd guess the LW team have their ways to measure or hypothesize about how much work is put into posts.

It could also be related to the average reader wanting to skim many things fast, as opposed to read a few deeply.

My feeling is that now we all assume by default that the epistemic status is tentative (except in obvious cases like papers).

It could also be that some discourse has become more polarized, and people are less likely to explicitly hedge their position through an epistemic status.

Or that the average reader being less isolated and thus more contextualized, and not as in need of epistemic hedges.

Or simply that less posts nowadays are structured around a central idea or claim, and thus different parts of the post have different epistemic statuses to be written at the top.

It could also be that post types have become more standardized, and each has their reason not to include these sections. For example:

Papers already have acknowledgments, and the epistemic status is diluted through the paper.
Stories or emotion-driven posts don't want to break the mood with acknowledgments (and don't warrant epistemic status).

Comment by Martín Soto (martinsq) on Looking back on my alignment PhD · 2024-07-08T03:55:49.026Z · LW · GW

This post is not only useful, but beautiful.

This, more than anything else on this website, reflects for me the lived experiences which demonstrate we can become more rational and effective at helping the world.

Many points of resonance with my experience since discovering this community. Many same blind-spots that I unfortunately haven't been able to shortcut, and have had to re-discover by myself. Although this does make me wish I had read some of your old posts earlier.

Comment by Martín Soto (martinsq) on Technologies and Terminology: AI isn't Software, it's... Deepware? · 2024-07-01T01:12:13.379Z · LW · GW

It should be called A-ware, short for Artificial-ware, given the already massive popularity of the term "Artificial Intelligence" to designate "trained-rather-than-programmed" systems.

It also seems more likely to me that future products will contain some AI sub-parts and some traditional-software sub-parts (rather than being wholly one or the other), and one or the other is utilized depending on context. We could call such a system Situationally A-ware.

Comment by Martín Soto (martinsq) on Daniel Kokotajlo's Shortform · 2024-05-24T14:54:05.924Z · LW · GW

That was dazzling to read, especially the last bit.

Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T13:12:17.162Z · LW · GW

Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well. But we shouldn't condition on solving alignment, because we haven't yet.

Thus, in our current situation, the only way anthropics pushes us towards "we should work more on non-agentic systems" is if you believe "world were we still exist are more likely to have easy alignment-through-non-agentic-AIs". Which you do believe, and I don't. Mostly because I think in almost no worlds we have been killed by misalignment at this point. Or put another way, the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us (and information in the current regime doesn't extrapolate much to the next one).

Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T12:23:14.365Z · LW · GW

Yes, but

This update is screened off by "you actually looking at the past and checking whether we got lucky many times or there is a consistent reason". Of course, you could claim that our understanding of the past is not perfect, and thus should still update, only less so. Although to be honest, I think there's a strong case for the past clearly showing that we just got lucky a few times.
It sounded like you were saying the consistent reason is "our architectures are non-agentic". This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us (instead of killing us in the next decade). I'm not of this opinion. And if I was, I'd need to take into account factors like "how much faster I'd have expected capabilities to advance", etc.

Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T12:08:13.253Z · LW · GW

Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.

Why? It sounds like you're anthropic updating on the fact that we'll exist in the future, which of course wouldn't make sense because we're not yet sure of that. So what am I missing?

Comment by Martín Soto (martinsq) on DanielFilan's Shortform Feed · 2024-05-09T08:22:32.896Z · LW · GW

Interesting, but I'm not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that's your true goal and you're committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of "winning by yourself", or "having fun", which is compatible with failing at another goal.
This could be taken to absurd extremes (whatever you're doing, I can understand you as optimizing really hard for doing exactly what you're doing), but the natural way around that is for your imputed goals to be required simple (in some background language or ontology, like that of humans). This is exactly the approach mathematically taken by Vanessa in the past (the equation at 3:50 here).
I think this "goal relativism" is fundamentally correct. The only problem with Vanessa's approach is that it's hard to account for the agent being mistaken (for example, you not knowing Shin is behind you).^[1]
I think the only natural way to account for this is to see things from the agent's native ontology (or compute probabilities according to their prior), however we might extract those from them. So we're unavoidably back at the problem of ontology identification (which I do think is the core problem).

^{^}
Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

(Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-05-04T09:49:36.981Z · LW · GW

Claude learns across different chats. What does this mean?

I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.

In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.

Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as "append previous chats as the start of the prompt, with a certain indicator that they are different", but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?

(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason "more intelligent" and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)

I've checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn't have it.
Claude also explicitly states it doesn't have cross-chat memory, when asked about it.
Might something happen like "it does have some chat memory, but it's told not to acknowledge this fact, but it sometimes slips"?

Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.

Comment by Martín Soto (martinsq) on William_S's Shortform · 2024-05-04T09:28:52.452Z · LW · GW

What's PPU?

User info

Posts

Comments