Posts

OpenAI releases GPT-4o, natively interfacing with text, voice and vision 2024-05-13T18:50:52.337Z
Conflict in Posthuman Literature 2024-04-06T22:26:04.051Z
Comparing Alignment to other AGI interventions: Extensions and analysis 2024-03-21T17:30:50.747Z
Comparing Alignment to other AGI interventions: Basic model 2024-03-20T18:17:50.072Z
How disagreements about Evidential Correlations could be settled 2024-03-11T18:28:25.669Z
Evidential Correlations are Subjective, and it might be a problem 2024-03-07T18:37:54.105Z
Why does generalization work? 2024-02-20T17:51:10.424Z
Natural abstractions are observer-dependent: a conversation with John Wentworth 2024-02-12T17:28:38.889Z
The lattice of partial updatelessness 2024-02-10T17:34:40.276Z
Updatelessness doesn't solve most problems 2024-02-08T17:30:11.266Z
Sources of evidence in Alignment 2023-07-02T20:38:34.089Z
Quantitative cruxes in Alignment 2023-07-02T20:38:18.534Z
Why are counterfactuals elusive? 2023-03-03T20:13:48.981Z
Martín Soto's Shortform 2023-02-11T23:38:29.999Z
The Alignment Problems 2023-01-12T22:29:26.515Z
Brute-forcing the universe: a non-standard shot at diamond alignment 2022-11-22T22:36:36.599Z
A short critique of Vanessa Kosoy's PreDCA 2022-11-13T16:00:45.834Z
Vanessa Kosoy's PreDCA, distilled 2022-11-12T11:38:12.657Z
Further considerations on the Evidentialist's Wager 2022-11-03T20:06:31.997Z
Enriching Youtube content recommendations 2022-09-27T16:54:41.958Z
An issue with MacAskill's Evidentialist's Wager 2022-09-21T22:02:47.920Z
General advice for transitioning into Theoretical AI Safety 2022-09-15T05:23:06.956Z
Alignment being impossible might be better than it being really difficult 2022-07-25T23:57:21.488Z
Which one of these two academic routes should I take to end up in AI Safety? 2022-07-03T01:05:23.956Z

Comments

Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T13:12:17.162Z · LW · GW

Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well. But we shouldn't condition on solving alignment, because we haven't yet.

Thus, in our current situation, the only way anthropics pushes us towards "we should work more on non-agentic systems" is if you believe "world were we still exist are more likely to have easy alignment-through-non-agentic-AIs". Which you do believe, and I don't. Mostly because I think in almost no worlds we have been killed by misalignment at this point. Or put another way, the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us (and information in the current regime doesn't extrapolate much to the next one).

Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T12:23:14.365Z · LW · GW

Yes, but

  1. This update is screened off by "you actually looking at the past and checking whether we got lucky many times or there is a consistent reason". Of course, you could claim that our understanding of the past is not perfect, and thus should still update, only less so. Although to be honest, I think there's a strong case for the past clearly showing that we just got lucky a few times.
  2. It sounded like you were saying the consistent reason is "our architectures are non-agentic". This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us (instead of killing us in the next decade). I'm not of this opinion. And if I was, I'd need to take into account factors like "how much faster I'd have expected capabilities to advance", etc.
Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T12:08:13.253Z · LW · GW

Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.


Why? It sounds like you're anthropic updating on the fact that we'll exist in the future, which of course wouldn't make sense because we're not yet sure of that. So what am I missing?

Comment by Martín Soto (martinsq) on DanielFilan's Shortform Feed · 2024-05-09T08:22:32.896Z · LW · GW

Interesting, but I'm not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that's your true goal and you're committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of "winning by yourself", or "having fun", which is compatible with failing at another goal.
This could be taken to absurd extremes (whatever you're doing, I can understand you as optimizing really hard for doing exactly what you're doing), but the natural way around that is for your imputed goals to be required simple (in some background language or ontology, like that of humans). This is exactly the approach mathematically taken by Vanessa in the past (the equation at 3:50 here).
I think this "goal relativism" is fundamentally correct. The only problem with Vanessa's approach is that it's hard to account for the agent being mistaken (for example, you not knowing Shin is behind you).[1]
I think the only natural way to account for this is to see things from the agent's native ontology (or compute probabilities according to their prior), however we might extract those from them. So we're unavoidably back at the problem of ontology identification (which I do think is the core problem).

  1. ^

    Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
    A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
    But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

    (Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-05-04T09:49:36.981Z · LW · GW

Claude learns across different chats. What does this mean?

 I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.

In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.

Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as "append previous chats as the start of the prompt, with a certain indicator that they are different", but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?

(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason "more intelligent" and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)

I've checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn't have it.
Claude also explicitly states it doesn't have cross-chat memory, when asked about it.
Might something happen like "it does have some chat memory, but it's told not to acknowledge this fact, but it sometimes slips"?

Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.

Comment by Martín Soto (martinsq) on William_S's Shortform · 2024-05-04T09:28:52.452Z · LW · GW

What's PPU?

Comment by Martín Soto (martinsq) on An explanation of evil in an organized world · 2024-05-02T08:03:20.008Z · LW · GW

I'm so happy someone came up with this!

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-04-11T21:04:34.853Z · LW · GW

Wow, I guess I over-estimated how absolutely comedic the title would sound!

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-04-11T20:33:54.823Z · LW · GW

In case it wasn't clear, this was a joke.

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-04-11T18:17:28.798Z · LW · GW

AGI doom by noise-cancelling headphones:                                                                            

ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.

(In case it wasn't clear, this is a joke.)

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-21T01:09:41.713Z · LW · GW

they need to reward outcomes which only they can achieve,

Yep! But this didn't seem so hard for me to happen, especially in the form of "I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever". You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they're the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.

it should be something which has diminishing marginal return to spending

Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I'm happy to make that trade-off).

But actually I don't think that this is a "dominant dynamic" because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews

Yeah. To be clear, the dynamic I think is "dominant" is "learning to learn better". Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-21T00:42:50.927Z · LW · GW

There's no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm.

Yes, absolutely! I just meant that, once you give me whatever V you choose to derive U from observations, I will just be able to apply UDT on top of that. So under this framework there doesn't seem to be anything new going on, because you are just choosing an algorithm V at the start of time, and then treating its outputs as observations. That's, again, why this only feels like a good model of "completely crystallized rigid values", and not of "organically building them up slowly, while my concepts and planner module also evolve, etc.".[1]

definitely doesn't imply "you get mugged everywhere"

Wait, but how does your proposal differ from EV maximization (with moral uncertainty as part of the EV maximization itself, as I explain above)?

Because anything that is doing pure EV maximization "gets mugged everywhere". Meaning if you actually have the beliefs (for example, that the world where suffering is hard to produce could exist), you just take those bets.
Of course if you don't have such "extreme" beliefs it doesn't, but then we're not talking about decision-making, and instead belief-formation. You could say "I will just do EV maximization, but never have extreme beliefs that lead to suspiciously-looking behavior", but that'd be hiding the problem under belief-formation, and doesn't seem to be the kind of efficient mechanism that agents really implement to avoid these failure modes.

  1. ^

    To be clear, V can be a very general algorithm (like "run a copy of me thinking about ethics"), so that this doesn't "feel like" having rigid values. Then I just think you're carving reality at the wrong spot. You're ignoring the actual dynamics of messy value formation, hiding them under V.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-21T00:32:29.010Z · LW · GW

I'd actually represent this as "subsidizing" some traders

Sounds good!

it's more a question of how you tweak the parameters to make this as unlikely as possible

Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don't fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like "dynamically changing the prior over traders"[1].

I'm assuming that traders can choose to ignore whichever inputs/topics they like, though. They don't need to make trades on everything if they don't want to.

Yep, that's why I believe "in the limit your traders will already do this". I just think it will be a dominant dynamic of efficient agents in the real world, so it's better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that's how real agents probably do it, computationally speaking.

  1. ^

    Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-20T23:57:11.805Z · LW · GW

But you need some mechanism for actually updating your beliefs about U

Yep, but you can just treat it as another observation channel into UDT. You could, if you want, treat it as a computed number you observe in the corner of your eye, and then just apply UDT maximizing U, and you don't need to change UDT in any way.

UDT says to pay here

(Let's not forget this depends on your prior, and we don't have any privileged way to assign priors to these things. But that's a tangential point.)

I do agree that there's not any sharp distinction between situations where it "seems good" and situations where it "seems bad" to get mugged. After all, if all you care about is maximizing EV, then you should take all muggings. It's just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go "hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don't really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal". You start to see how your abstractions might break, and how you can't get any satisfying notion of "complete updatelessness" (that doesn't go against important intuitions). And you start to rethink whether this is what we normatively want, nor what we realistically see in agents.

Comment by Martín Soto (martinsq) on Comparing Alignment to other AGI interventions: Basic model · 2024-03-20T23:25:21.859Z · LW · GW

You're right, I forgot to explicitly explain that somewhere! Thanks for the notice, it's now fixed :)

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-20T23:19:36.917Z · LW · GW

I like this picture! But

Voting on what actions get reward

I think real learning has some kind of ground-truth reward. So we should clearly separate between "this ground-truth reward that is chiseling the agent during training (and not after training)", and "the internal shards of the agent negotiating and changing your exact objective (which can happen both during and after training)". I'd call the latter "internal value allocation", or something like that. It doesn't neatly correspond to any ground truth, and is partly determined by internal noise in the agent. And indeed, eventually, when you "stop training" (or at least "get decoupled enough from reward"), it just evolves of its own, separate from any ground truth.

And maybe more importantly:

  • I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you'll need a modification of this framework which explains why that's not the case.
  • My intuition is a process of the form "eventually, traders (or some kind of specialized meta-traders) change the learning process itself to make it more efficient". For example, they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don't lose much, and you waste less compute. Probably these dynamics will already be "in the limit" applied by your traders, but it will be the dominant dynamic so it should be directly represented by the formalism.
  • Finally, this might come later, and not yet in the level of abstraction you're using, but I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all. It's conceivable to say "this is the ideal mechanism, and real agents are just hacky approximations to it, so we should study the ideal mechanism first". But my intuition says, on the contrary, some of the physical constraints (like locality, or the architecture of nets) will strongly shape which kind of macroscopic mechanism you get, and these will present pretty different convergent behavior. This is related, but not exactly equivalent to, partial agency.
Comment by Martín Soto (martinsq) on Policy Selection Solves Most Problems · 2024-03-20T23:05:26.458Z · LW · GW

It certainly seems intuitively better to do that (have many meta-levels of delegation, instead of only one), since one can imagine particular cases in which it helps. In fact we did some of that (see Appendix E).

But this doesn't really fundamentally solve the problem Abram quotes in any way. You add more meta-levels in-between the selector and the executor, thus you get more lines of protection against updating on infohazards, but you also get more silly decisions from the very-early selector. The trade-off between infohazard protection and not-being-silly remains. The quantitative question of "how fast should f grow" remains.

And of course, we can look at reality, or also check our human intuitions, and discover that, for some reason, this or that kind of f, or kind of delegation procedure, tends to work better in our distribution. But the general problem Abram quotes is fundamentally unsolvable. "The chaos of a too-early market state" literally equals "not having updated on enough information". "Knowledge we need to be updateless toward" literally equals "having updated on too much information". You cannot solve this problem in full generality, except if you already know exactly what information you want to update on... which means, either already having thought long and hard about it (thus you updated on everything), or you lucked into the right prior without thinking.

Thus, Abram is completely right to mention that we have to think about the human prior, and our particular distribution, as opposed to search for a general solution that we can prove mathematical things about.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-20T22:53:15.513Z · LW · GW

People back then certainly didn't think of changing preferences.

Also, you can get rid of this problem by saying "you just want to maximize the variable U". And the things you actually care about (dogs, apples) are just "instrumentally" useful in giving you U. So for example, it is possible in the future you will learn dogs give you a lot of U, or alternatively that apples give you a lot of U.
Needless to say, this "instrumentalization" of moral deliberation is not how real agents work. And leads to getting Pascal's mugged by the world in which you care a lot about easy things.

It's more natural to model U as a logically uncertain variable, freely floating inside your logical inductor, shaped by its arbitrary aesthetic preferences. This doesn't completely miss the importance of reward in shaping your values, but it's certainly very different to how frugally computable agents do it.

I simply think the EV maximization framework breaks here. It is a useful abstraction when you already have a rigid enough notion of value, and are applying these EV calculations to a very concrete magisterium about which you can have well-defined estimates.
Otherwise you get mugged everywhere. And that's not how real agents behave.

Comment by Martín Soto (martinsq) on Comparing Alignment to other AGI interventions: Basic model · 2024-03-20T22:41:04.886Z · LW · GW

My impression was that this one model was mostly Hjalmar, with Tristan's supervision. But I'm unsure, and that's enough to include anyway, so I will change that, thanks :)

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-03-19T18:22:05.378Z · LW · GW

Brain-dump on Updatelessness and real agents                            

Building a Son is just committing to a whole policy for the future. In the formalism where our agent uses probability distributions, and ex interim expected value maximization decides your action... the only way to ensure dynamic stability (for your Son to be identical to you) is to be completely Updateless. That is, to decide something using your current prior, and keep that forever.

Luckily, real agents don't seem to work like that. We are more of an ensemble of selected-for heuristics, and it seems true scope-sensitive complete Updatelessnes is very unlikely to come out of this process (although we do have local versions of non-true Updatelessness, like retributivism in humans).
In fact, it's not even exactly clear how I would use my current brain-state could decide something for the whole future. It's not even well-defined, like when you're playing a board-game and discover some move you were planning isn't allowed by the rules. There are ways to actually give an exhaustive definition, but I suspect the ones that most people would intuitively like (when scrutinized) are sneaking in parts of Updatefulness (which I think is the correct move).

More formally, it seems like what real-world agents do is much better-represented by what I call "Slow-learning Policy Selection". (Abram had a great post about this called "Policy Selection Solves Most Problems", which I can't find now.) This is a small agent (short computation time) recommending policies for a big agent to follow in the far future. But the difference with complete Updatelessness is that the small agent also learns (much more slowly than the big one). Thus, if the small agent thinks a policy (like paying up in Counterfactual Mugging) is the right thing to do, the big agent will implement this for a pretty long time. But eventually the small agent might change its mind, and start recommending a different policy. I basically think that all problems not solved by this are unsolvable in principle, due to the unavoidable trade-off between updating and not updating.[1]

This also has consequences for how we expect superintelligences to be. If by them having “vague opinions about the future” we mean a wide, but perfectly rigorous and compartmentalized probability distribution over literally everything that might happen, then yes, the way to maximize EV according to that distribution might be some very concrete, very risky move, like re-writing to an algorithm because you think simulators will reward this, even if you’re not sure how well that algorithm performs in this universe.
But that’s not how abstractions or uncertainty work mechanistically! Abstractions help us efficiently navigate the world thanks to their modular, nested, fuzzy structure. If they had to compartmentalize everything in a rigorous and well-defined way, they’d stop working. When you take into account how abstractions really work, the kind of partial updatefulness we see in the world is what we'd expect. I might write about this soon.

  1. ^

    Surprisingly, in some conversations others still wanted to "get both updatelessness and updatefulness at the same time". Or, receive the gains from Value of Information, and also those from Strategic Updatelessness. Which is what Abram and I had in mind when starting work. And is, when you understand what these words really mean, impossible by definition.

Comment by Martín Soto (martinsq) on 'Empiricism!' as Anti-Epistemology · 2024-03-16T22:17:54.922Z · LW · GW

Cool connections! Resonates with how I've been thinking about intelligence and learning lately.
Some more connections:

Indeed, those savvier traders might even push me to go look up that data (using, perhaps, some kind of internal action auction), in order to more effectively take the simple trader's money

That's reward/exploration hacking.
Although I do think most times we "look up some data" in real life it's not due to an internal heuristic / subagent being strategic enough to purposefully try and exploit others, but rather just because some earnest simple heuristics recommending to look up information have scored well in the past.

They haven't taken its money yet," said the Scientist, "But they will before it gets a chance to invest any of my money

I think this doesn't always happen. As good as the internal traders might be, the agent sometimes needs to explore, and that means giving up some of the agent's money.

Now, if I were an ideal Garrabrant inductor I would ignore these arguments, and only pay attention to these new traders' future trades. But I have not world enough or time for this; so I've decided to subsidize new traders based on how they would have done if they'd been trading earlier.

Here (starting at "Put in terms of Logical Inductors") I mention other "computational shortcuts" for inductors. Mainly, if two "categories of bets" seem pretty unrelated (they are two different specialized magisteria), then not having thick trade between them won't lose you out on much performance (and will avoid much computation).
You can have "meta-traders" betting on which categories of bets are unrelated (and testing them but only sparsely, etc.), and use them to make your inductor more computationally efficient. Of course object-level traders already do this (decide where to look, etc.), and in the limit this will converge like a Logical Inductor, but I have the intuition this will converge faster (at least, in structured enough domains).
This is of course very related to my ideas and formalism on meta-heuristics.

helps prevent clever arguers from fooling me (and potentially themselves) with overfitted post-hoc hypotheses

This adversarial selection is also a problem for heuristic arguments: Your heuristic estimator might be very good at assessing likelihoods given a list of heuristic arguments, but what if the latter has been selected against your estimator, top drive it in a wrong direction?
Last time I discussed this with them (very long ago), they were just happy to pick an apparently random process to generate the heuristic arguments, that they're confident enough hasn't been tampered with.
Something more ambitious would be to have the heuristic estimator also know about the process that generated the list of heuristic arguments, and use these same heuristic arguments to assess whether something fishy is going on. This will never work perfectly, but probably helps a lot in practice.
(And I think this is for similar reasons to why deception might be hard: When not the output, but also the "thoughts", of the generating process are scrutinized, it seems hard for it to scheme without being caught.)

Comment by Martín Soto (martinsq) on How disagreements about Evidential Correlations could be settled · 2024-03-11T19:30:39.047Z · LW · GW

I think it would be helpful to have a worked example here -- say, the twin PD

As in my A B C example, I was thinking of the simpler case in which two agents disagree about their joint correlation to a third. If the disagreement happens between two sides of a twin PD, then they care about slightly different questions (how likely A is to Cooperate if B Cooperates, and how likely B is to Cooperate if A Cooperates), instead of the same question. And this presents complications in exposition. Although, if they wanted to, they could still share their heuristics, etc.

To be clear, I didn't provide a complete specification of "what action a and action c are" (which game they are playing), just because it seemed to distract from the topic. That is, the relevant part is their having different beliefs on any correlation, not its contents.

Uh oh, this is starting to sound like Oesterheld's Decision Markets stuff. 

Yes! But only because I'm directly thinking of Logical Inductors, which are the same for epistemics. Better said, Caspar throws everything (epistemics and decision-making) into the traders, and here I am still using Inductors, which only throw epistemics into the traders.

My point is:
"In our heads, we do logical learning by a process similar to Inductors. To resolve disagreements about correlations, we can merge our Inductors in different ways. Some are lower-bandwidth and frugal, while others are higher-bandwidth and expensive."
Exactly analogous points could be made about our decision-making (instead of beliefs), thus the analogy would be to Decision Markets instead of Logical Inductors.

Comment by Martín Soto (martinsq) on nielsrolf's Shortform · 2024-03-10T02:58:13.236Z · LW · GW

Sounds a lot like this, or also my thoughts, or also shard theory!

Comment by Martín Soto (martinsq) on Evidential Correlations are Subjective, and it might be a problem · 2024-03-07T19:10:00.107Z · LW · GW

Thanks for the tip :)

Yes, I can certainly argue that. In a sense, the point is even deeper: we have some intuitive heuristics for what it means for players to have "similar algorithms", but what we ultimately care about is how likely it is that if I cooperate you cooperate, and when I don't cooperate, there's no ground truth about this. It is perfectly possible for one or both of the players to (due to their past logical experiences) believe they are not correlated, AND (this is the important point) if they thus don't cooperate, this belief will never be falsified. This "falling in a bad equilibrium by your own fault" is exactly the fundamental problem with FixDT (and more in general, fix-points and action-relevant beliefs).

More realistically, both players will continue getting observations about ground-truth math and playing games with other players, and so the question becomes whether these learnings will be enough to quick them out of any dumb equilibria.

Comment by Martín Soto (martinsq) on Evidential Correlations are Subjective, and it might be a problem · 2024-03-07T19:04:32.350Z · LW · GW

I think you're right, see my other short comments below about epsilon-exploration as a realistic solution. It's conceivable that something like "epsilon-exploration plus heuristics on top" groks enough regularities that performance at some finite time tends to be good. But who knows how likely that is.

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-03-06T20:23:52.749Z · LW · GW

I think that's the right next question!

The way I was thinking about it, the mathematical toy model would literally have the structure of microstates and macrostates. What we need is a set of (lawfully, deterministically) evolving microstates in which certain macrostate partitions (macroscopic regularities, like pressure) are statistically maintained throughout the evolution. And then, for my point, we'd need two different macrostate partitions (or sets of macrostate partitions) such that each one is statistically preserved. That is, complex macroscopic patterns it self-replicate (a human tends to stay in the macrostate partition of "the human being alive"). And they are mostly independent (humans can't easily learn about the completely different partition, otherwise they'd already be in the same partition).

In the direction of "not making it trivial", I think there's an irresolvable tension. If by "not making it trivial" you mean "s1 and s2 don't obviously look independent to us", then we can get this, but it's pretty arbitrary. I think the true name of "whether s1 and s2 are independent" is "statistical mutual information (of the macrostates)". And then, them being independent is exactly what we're searching for. That is, it wouldn't make sense to ask for "independent pattern-universes coexisting on the same substrate", while at the same time for "the pattern-universes (macrostate partitions) not to be truly independent".

I think this successfully captures the fact that my point/realization is, at its heart, trivial. And still, possibly deconfusing about the observer-dependence of world-modelling.

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-03-06T20:12:40.783Z · LW · GW

My post is consistent with what Eliezer says there. My post would simply remark:
You are already taking for granted a certain low-level / atomic set of variables = macro-states (like mortal, featherless, biped). Let me bring to your attention that you pay attention to these variables because they are written in a macro-state partition similar / useful to your own. It is conceivable for some external observer to look at low-level physics, and interpret it through different atomic macro-states (different from mortal, featherless, biped).

The same applies to unsupervised learning. It's not surprising that macro-states expressed in a certain language (the computation methods we've built to find simple regularities in certain sets of macroscopic variables). As before, there simply are just already some macro-state partitions we pay attention to, in which these macroscopic variables are expressed (but not others like "the exact position of a particle"), and also in which we build our tools (similarly to how our sensory perceptors are also built in them).

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-03-06T19:58:14.143Z · LW · GW

By random I just meant "no simple underlying regularity explains it shortly". For example, a low-degree polynomial has a very short description length. While a random jumble of points doesn't (you need to write the points one by one). This of course already assumes a language.

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-03-06T19:43:29.024Z · LW · GW

Thank you Sylvester for the academic reference, and Wei for your thoughts!

I do understand from the SEP, like Wei, that sophisticated means "backwards planning", and resolute means "being able to commit to a policy" (correct me if I'm wrong).

My usage of "dynamic instability" (which might be contrary to academic usage) was indeed what Wei mentions: "not needing commitments". Or equivalently, I say a decision theory is dynamically stable if itself and its resolute version always act the same.

There are some ways to formalize exactly what I mean by "not needing commitments", for example see here, page 3, Desiderata 2 (Tiling result), although that definition is pretty in the weeds.

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-02-28T22:38:10.179Z · LW · GW

Marginally against legibilizing my own reasoning:     

When taking important decisions, I spend too much time writing down the many arguments, and legibilizing the whole process for myself. This is due to completionist tendencies. Unfortunately, a more legible process doesn’t overwhelmingly imply a better decision!

Scrutinizing your main arguments is necessary, although this looks more like intuitively assessing their robustness in concept-space than making straightforward calculations, given how many implicit assumptions they all have. I can fill in many boxes, and count and weigh considerations in-depth, but that’s not a strong signal, nor what almost ever ends up swaying me towards a decision!

Rather than folding, re-folding and re-playing all of these ideas inside myself, it’s way more effective time-wise to engage my System 1 more: intuitively assess the strength of different considerations, try to brainstorm new ways in which the hidden assumptions fail, try to spot the ways in which the information I’ve received is partial… And of course, share all of this with other minds, who are much more likely to update me than my own mind. All of this looks more like rapidly racing through intuitions than filling Excel sheets, or having overly detailed scoring systems.

For example, do I really think I can BOTEC the expected counterfactual value (IN FREAKING UTILONS) of a new job position? Of course a bad BOTEC is better than none, but the extent to which that is not how our reasoning works, and the work is not really done by the BOTEC at all, is astounding. Maybe at that point you should stop calling it a BOTEC.

Comment by Martín Soto (martinsq) on CFAR Takeaways: Andrew Critch · 2024-02-24T22:16:08.904Z · LW · GW

This is pure gold, thanks for sharing!

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-02-21T06:16:14.716Z · LW · GW

Didn't know about ruliad, thanks!

I think a central point here is that "what counts as an observer (an agent)" is observer-dependent (more here) (even if under our particular laws of physics there are some pressures towards agents having a certain shape, etc., more here). And then it's immediate each ruliad has an agent (for the right observer) (or similarly, for a certain decryption of it).

I'm not yet convinced "the mapping function/decryption might be so complex it doesn't fit our universe" is relevant. If you want to philosophically defend "functionalism with functions up to complexity C" instead of "functionalism", you can, but C starts seeming arbitrary?

Also, a Ramsey-theory argument would be very cool.

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T21:33:03.480Z · LW · GW

- Chan master Yunmon

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-02-20T21:29:14.710Z · LW · GW

Yep! Although I think the philosophical point goes deeper. The algorithm our brains themselves use to find a pattern is part of the picture. It is a kind of "fixed (de/)encryption".

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T20:39:23.519Z · LW · GW

Thank you, habryka!

As mentioned in my answer to Eliezer, my arguments were made with that correct version of updatelessness in mind (not "being scared to learn information", but "ex ante deciding whether to let this action depend on this information"), so they hold, according to me.
But it might be true I should have stressed this point more in the main text.

Comment by Martín Soto (martinsq) on The lattice of partial updatelessness · 2024-02-20T20:36:13.489Z · LW · GW

Yep! I hadn't included pure randomization in the formalism, but it can be done and will yield some interesting insights.

As you mention, we can also include pseudo-randomization. And taking these bounded rationality considerations into account also makes our reasoning richer and more complex: it's unclear exactly when an agent wants to obfuscate its reasoning from others, etc.

Comment by Martín Soto (martinsq) on The lattice of partial updatelessness · 2024-02-20T20:34:03.455Z · LW · GW

First off, that  was supposed to be , sorry.

The agent might commit to "only updating on those things accepted by program ", even when it still doesn't have the complete infinite list of "exactly in which things does  update" (in fact, this is always the case, since we can't hold an infinite list in our head). It will, at the time of committing, know that  updates on certain things, doesn't update on others... and it is uncertain about exactly what it does in all other situations. But that's okay, that's what we do all the time: decide on an endorsed deliberation mechanism based on its structural properties, without yet being completely sure of what it does (otherwise, we wouldn't need the deliberation). But it does advise against committing while being too ignorant.

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T20:29:30.067Z · LW · GW

Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations?

Certainly not, in the most general case, as you correctly point out.

Here I was studying a particular case: updateless agents in a world remotely looking like the real world. And even more particular: thinking about the kinds of priors that superintelligences created in the real world might actually have.

Eliezer believes that, in these particular cases, it's very likely we will get optimal behavior (we won't get trapped priors, nor commitment races). I disagree, and that's what I argue in the post.

I'm also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.

If by "predictable in advance" you mean "from the updateless agent's prior", then nope! Updatelessness maximizes EV from the prior, so it will do whatever looks best from this perspective. If that's what you want, then updatelessness is for you! The problem is, we have many pro tanto reasons to think this is not a good representation of rational decision-making in reality, nor the kind of cognition that survives for long in reality. Because of considerations about "the world being so complex that your prior will be missing a lot of stuff". And in particular, multi-agentic scenarios are something that makes this complexity sky-rocket.
Of course, you can say "but that consideration will also be included in your prior". And that does make the situation better. But eventually your prior needs to end. And I argue, that's much before you have all the necessary information to confidently commit to something forever (but other people might disagree with this).

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T20:23:09.050Z · LW · GW

Is this consistent with the way you're describing decision-making procedures as updateful and updateless?

Absolutely. A good implementation of UDT can, from its prior, decide on an updateful strategy. It's just it won't be able to change its mind about which updateful strategy seems best. See this comment for more.

"flinching away from true information"

As mentioned also in that comment, correct implementations of UDT don't actually flinch away from information: they just decide ex ante (when still not having access to that information) whether or not they will let their future actions depend on it.

The problem remains though: you make the ex ante call about which information to "decision-relevantly update on", and this can be a wrong call, and this creates commitment races, etc.

Comment by Martín Soto (martinsq) on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-20T20:12:16.679Z · LW · GW

I'm not sure we are in disagreement. No one is negating that the territory shapes the maps (which are part of the territory). The central point is just that our perception of the territory is shaped by our perceptors, etc., and need not be the same. It is still conceivable that, due to how the territory shapes this process (due to the most likely perceptors to be found in evolved creatures, etc.), there ends up being a strong convergence so that all maps represent isomorphically certain territory properties. But this is not a given, and needs further argumentation. After all, it is conceivable for a territory to exist that incentivizes the creation of two very different and non-isomorphic types of maps. But of course, you can argue our territory is not such, by looking at its details.

Where “joint carvy-ness” will end up being, I suspect, related to “gears that move the world,” i.e., the bits of the territory that can do surprisingly much, have surprisingly much reach, etc.

I think this falls for the same circularity I point at in the post: you are defining "naturalness of a partition" as "usefulness to efficiently affect / control certain other partitions", so you already need to care about the latter. You could try to say something like "this one partition is useful for many partitions", but I think that's physically false, by combinatorics (in all cases you can always build as many partitions that are affected by another one). More on these philosophical subtleties here: Why does generalization work?

Comment by Martín Soto (martinsq) on OpenAI's Sora is an agent · 2024-02-16T19:12:15.980Z · LW · GW

Guy who reinvents predictive processing through Minecraft

Comment by Martín Soto (martinsq) on The Commitment Races problem · 2024-02-15T18:58:50.509Z · LW · GW

I agree most superintelligences won't do something which is simply "play the ordinal game" (it was just an illustrative example), and that a superintelligence can implement your proposal, and that it is conceivable most superintelligences implement something close enough to your proposal that they reach Pareto-optimality. What I'm missing is why that is likely.

Indeed, the normative intuition you are expressing (that your policy shouldn't in any case incentivize the opponent to be more sophisticated, etc.) is already a notion of fairness (although in the first meta-level, rather than object-level). And why should we expect most superintelligences to share it, given the dependence on early beliefs and other pro tanto normative intuitions (different from ex ante optimization)? Why should we expect this to be selected for? (Either inside a mind, or by external survival mechanisms)
Compare, especially, to a nascent superintelligence who believes most others might be simulating it and best-responding (thus wants to be stubborn). Why should we think this is unlikely?
Probably if I became convinced trapped priors are not a problem I would put much more probability on superintelligences eventually coordinating.

Another way to put it is: "Sucks to be them!" Yes sure, but also sucks to be me who lost the $1! And maybe sucks to be me who didn't do something super hawkish and got a couple other players to best-respond! While it is true these normative intuitions pull on me less than the one you express, why should I expect this to be the case for most superintelligences?

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-15T05:37:26.541Z · LW · GW

Thank you for engaging, Eliezer.

I completely agree with your point: an agent being updateless doesn't mean it won't learn new information. In fact, it might perfectly decide to "make my future action A depend on future information X", if the updateless prior so finds it optimal. While in other situations, when the updateless prior deems it net-negative (maybe due to other agents exploiting this future dependence), it won't.

This point is already observed in the post (see e.g. footnote 4), although without going deep into it, due to the post being meant for the layman (it is more deeply addressed, for example, in section 4.4 of my report). Also for illustrative purposes, in two places I have (maybe unfairly) caricaturized an updateless agent as being "scared" of learning more information. While really, what this means (as hopefully clear from earlier parts of the post) is "the updateless prior assessed whether it seemed net-positive to let future actions depend on future information, and decided no (for almost all actions)".

The problem I present is not "being scared of information", but the trade-off between "letting your future action depend on future information X" vs "not doing so" (and, in more detail, how exactly it should depend on such information). More dependence allows you to correctly best-respond in some situations, but also could sometimes get you exploited. The problem is there's no universal (belief-independent) rule to assess when to allow for dependence: different updateless priors will decide differently. And need to do so in advance of letting their deliberation depend on their interactions (they still don't know if that's net-positive).
Due to this prior-dependence, if different updateless agents have different beliefs, they might play very different policies, and miscoordinate. This is also analogous to different agents demanding different notions of fairness (more here). I have read no convincing arguments as to why most superintelligences will converge on beliefs (or notions of fairness) that successfully coordinate on Pareto optimality (especially in the face of the problem of trapped priors i.e. commitment races), and would be grateful if you could point me in their direction.

I interpret you as expressing a strong normative intuition in favor of ex ante optimization. I share this primitive intuition, and indeed it remains true that, if you have some prior and simply want to maximize its EV, updatelessness is exactly what you need. But I think we have discovered other pro tanto reasons against updatelessness, like updateless agents probably performing worse on average (in complex environments) due to trapped priors and increased miscoordination.

Comment by Martín Soto (martinsq) on The Commitment Races problem · 2024-02-15T05:27:58.880Z · LW · GW

The normative pull of your proposed procedure seems to come from a preconception that "the other player will probably best-respond to me" (and thus, my procedure is correctly shaping its incentives).

But instead we can consider the other player trying to get us to best-respond to them, by jumping up a meta-level: the player checks whether I am playing your proposed policy with a certain notion of fairness $X (which in your case is $5), and punishes accordingly to how far their notion of fairness $Y is from my $X, so that I (if I were to best-respond to his policy) would be incentivized to adopt notion of fairness $Y.

It seems clear that, for the exact same reason your argument might have some normative pull, this other argument has some normative pull in the opposite direction. It then becomes unclear which has stronger normative pull: trying to shape the incentives of the other (because you think they might play a policy one level of sophistication below yours), or trying to best-respond to the other (because you think they might play a policy one level of sophistication above yours).

I think this is exactly the deep problem, the fundamental trade-off, that agents face in both empirical and logical bargaining. I am not convinced all superintelligences will resolve this trade-off in similar enough ways to allow for Pareto-optimality (instead of falling for trapped priors i.e. commitment races), due to the resolution's dependence on the superintelligences' early prior.

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-13T03:18:02.406Z · LW · GW

(Sorry, short on time now, but we can discuss in-person and maybe I'll come back here to write the take-away)

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-11T04:25:57.226Z · LW · GW

To me it feels like the natural place to draw the line is update-on-computations but updateless-on-observations.

A first problem with this is that there is no sharp distinction between purely computational (analytic) information/observations and purely empirical (synthetic) information/observations. This is a deep philosophical point, well-known in the analytic philosophy literature, and best represented by Quine's Two dogmas of empiricism, and his idea of the "Web of Belief". (This is also related to Radical Probabilisim.)
But it's unclear if this philosophical problem translates to a pragmatic one. So let's just assume that the laws of physics are such that all superintelligences we care about converge on the same classification of computational vs empirical information.

A second and more worrying problem is that, even given such convergence, it's not clear all other agents will decide to forego the possible apparent benefits of logical exploitation. It's a kind of Nash equilibrium selection problem: If I was very sure all other agents forego them (and have robust cooperation mechanisms that deter exploitation), then I would just do like them. And indeed, it's conceivable that our laws of physics (and algorithmics) are such that this is the case, and all superintelligences converge on the Schelling point of "never exploiting the learning of logical information". But my probability of that is not very high, especially due to worries that different superintelligences might start with pretty different priors, and make commitments early on (before all posteriors have had time to converge). (That said, my probability is high that almost all deliberation is mostly safe, by more contingent reasons related to the heuristics they use and values they have.)
You might also want to say something like "they should just use the correct decision theory to converge on the nicest Nash equilibrium!". But that's question-begging, because the worry is exactly that others might have different notions of this normative "nice" (indeed, no objective criterion for decision theory). The problem recurs: we can't just invoke a decision theory to decide on the correct decision theory.

Am I missing something about why logical counterfactual muggings are likely to be common?

As mentioned in the post, Counterfactual Mugging as presented won't be common, but equivalent situations in multi-agentic bargaining might, due to (the naive application of) some priors leading to commitment races. (And here "naive" doesn't mean "shooting yourself in the foot", but rather "doing what looks best from the prior", even if unbeknownst to you it has dangerous consequences.)

if it comes up it seems that an agent that updates on computations can use some precommitment mechanism to take advantage of it

It's not looking like something as simple as that will solve, because of reasoning as in this paragraph:

Unfortunately, it’s not that easy, and the problem recurs at a higher level: your procedure to decide which information to use will depend on all the information, and so you will already lose strategicness. Or, if it doesn’t depend, then you are just being updateless, not using the information in any way.

Or in other words, you need to decide on the precommitment ex ante, when you still haven't thought much about anything, so your precommitment might be bad.
(Although to be fair there are ongoing discussions about this.)

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-09T22:24:52.538Z · LW · GW

It seems like we should be able to design software systems that are immune to any infohazard

As mentioned in another comment, I think this is not possible to solve in full generality (meaning, for all priors), because that requires complete updatelessness and we don't want to do that.

I think all your proposed approaches are equivalent (and I think the most intuitive framing is "cognitive sandboxes"). And I think they don't work, because of reasoning close to this paragraph:

Unfortunately, it’s not that easy, and the problem recurs at a higher level: your procedure to decide which information to use will depend on all the information, and so you will already lose strategicness. Or, if it doesn’t depend, then you are just being updateless, not using the information in any way.

But again, the problem might be solvable in particular cases (like, our prior).

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-09T22:19:16.547Z · LW · GW

The motivating principle is to treat one's choice of decision theory as itself strategic.

I share the intuition that this lens is important. Indeed, there might be some important quantitative differences between
a) I have a well-defined decision theory, and am choosing how to build my successor
and
b) I'm doing some vague normative reasoning to choose a decision theory (like we're doing right now),
but I think these differences are mostly contingent, and the same fundamental dynamics about strategicness are at play in both scenarios.

Design your decision theory so that no information is hazardous to it

I think this is equivalent to your decision theory being dynamically stable (that is, its performance never improves by having access to commitments), and I'm pretty sure the only way to attain this is complete updatelessness (which is bad).

That said, again, it might perfectly be that given our prior, many parts of cooperation-relevant concept-space seem very safe to explore, and so "for all practical purposes" some decision procedures are basically completely safe, and we're able to use them to coordinate with all agents (even if we haven't "solved in all prior-independent generality" the fundamental trade-off between updatelessness and updatefulness).

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-09T02:05:25.477Z · LW · GW

I agree that the situation is less
"there is a theoretical problem which is solvable but our specification of Updatelessness is not solving"
and more
"there is a fundamental obstacle in game-theoretic interactions (at least the way we model them)".

Of course, even if this obstacle is "unavoidable in principle" (and no theoretical solution will get rid of it completely and for all situations), there are many pragmatic and realistic solutions (partly overfit to the situation we already know we are actually in) that can improve interactions. So much so as to conceivably even dissolve the problem into near-nothingness (although I'm not sure I'm that optimistic).

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-09T02:01:28.314Z · LW · GW

You're right that (a priori and on the abstract) "bargaining power" fundamentally trades off against "best-responding". That's exactly the point of my post. This doesn't prohibit, though, that a lot of pragmatic and realistic improvements are possible (because we know agents in our reality tend to think like this or like that), even if the theoretical trade-off can never be erased completely or in all situations and for all priors.

Your latter discussion is a normative one. And while I share your normative intuitions that best-responding completely (being completely updateful) is not always the best to do in realistic situations, I do have quibbles with this kind of discourse (similar to this). For example, why would I want to go Straight even after I have learned the other does? Out of some terminal valuation of fairness, or counterfactuals, more than anything, I think (more here). Or similarly, why should I think sticking to my notion of fairness shall ex ante convince the other player to coordinate on it, as opposed to the other player trying to pull out some "even more meta" move, like punishing notions of fairness that are not close enough to theirs? Again, all of this will depend on our priors.