Posts

OpenAI releases GPT-4o, natively interfacing with text, voice and vision 2024-05-13T18:50:52.337Z
Conflict in Posthuman Literature 2024-04-06T22:26:04.051Z
Comparing Alignment to other AGI interventions: Extensions and analysis 2024-03-21T17:30:50.747Z
Comparing Alignment to other AGI interventions: Basic model 2024-03-20T18:17:50.072Z
How disagreements about Evidential Correlations could be settled 2024-03-11T18:28:25.669Z
Evidential Correlations are Subjective, and it might be a problem 2024-03-07T18:37:54.105Z
Why does generalization work? 2024-02-20T17:51:10.424Z
Natural abstractions are observer-dependent: a conversation with John Wentworth 2024-02-12T17:28:38.889Z
The lattice of partial updatelessness 2024-02-10T17:34:40.276Z
Updatelessness doesn't solve most problems 2024-02-08T17:30:11.266Z
Sources of evidence in Alignment 2023-07-02T20:38:34.089Z
Quantitative cruxes in Alignment 2023-07-02T20:38:18.534Z
Why are counterfactuals elusive? 2023-03-03T20:13:48.981Z
Martín Soto's Shortform 2023-02-11T23:38:29.999Z
The Alignment Problems 2023-01-12T22:29:26.515Z
Brute-forcing the universe: a non-standard shot at diamond alignment 2022-11-22T22:36:36.599Z
A short critique of Vanessa Kosoy's PreDCA 2022-11-13T16:00:45.834Z
Vanessa Kosoy's PreDCA, distilled 2022-11-12T11:38:12.657Z
Further considerations on the Evidentialist's Wager 2022-11-03T20:06:31.997Z
Enriching Youtube content recommendations 2022-09-27T16:54:41.958Z
An issue with MacAskill's Evidentialist's Wager 2022-09-21T22:02:47.920Z
General advice for transitioning into Theoretical AI Safety 2022-09-15T05:23:06.956Z
Alignment being impossible might be better than it being really difficult 2022-07-25T23:57:21.488Z
Which one of these two academic routes should I take to end up in AI Safety? 2022-07-03T01:05:23.956Z

Comments

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-26T22:46:26.555Z · LW · GW

Now it makes sense, thank you!

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-20T20:58:07.710Z · LW · GW

Thanks! I don't understand the logic behind your setup yet.

Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words

But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is "generating only two words across all random seeds, and furthermore ensuring they have these probabilities".

The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds

My understanding of what you're saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow "relaxed" scoring (I don't know exactly how you did this) to be more lenient with this failure mode.

So my question is: if you faced the "problem" that the LLM didn't reliably output the same word pair (and wanted to solve this problem in some way), why didn't you change the prompt to stop encouraging the word pair dependence on the random seed?
Maybe what you're saying is that you indeed tried this, and even then there were many different word pairs (the change didn't make a big difference), so you had to "relax" scoring anyway.
(Even in this case, I don't understand why you'd include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-18T19:24:25.326Z · LW · GW

you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset

Hm, I was thinking something as easy to categorize as "multiplying numbers of n digits", or "the different levels of MMLU" (although again, they already know about MMLU), or "independently do X online (for example create an account somewhere)", or even some of the tasks from your paper.

I guess I was thinking less about "what facts they know", which is pure memorization (although this is also interesting), and more about "cognitively hard tasks", that require some computational steps.

Comment by Martín Soto (martinsq) on Me & My Clone · 2024-07-18T18:57:57.937Z · LW · GW

Given your clone is a perfectly mirrored copy of yourself down to the lowest physical level (whatever that means), then breaking symmetry would violate the homogeneity or isotropy of physics. I don't know where the physics literature stands on the likelihood of that happening (even though certainly we don't see macroscopic violations).

Of course, it might be an atom-by-atom copy is not a copy down to the lowest physical level, in which case trivially you can get eventual asymmetry. I mean, it doesn't even make complete sense to say "atom-by-atom copy" in the language of quantum mechanics, since you can't be arbitrarily certain about the position and velocity of each atom. Maybe saying something like "the quantum state function of the whole room is perfectly symmetric in this specific way". I think then (if that is indeed the lowest physical level) the function will remain symmetric forever, but maybe in some universes you and your copy end up in different places? That is, the symmetry would happen at another level in this example: across universes, and not necessarily inside each single universe?

It might also be there is no lowest physical level, just unending complexity all the way down (this had a philosophical name which I now forget).

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-17T17:47:37.945Z · LW · GW

Another idea: Ask the LLM how well it will do on a certain task (for example, which fraction of math problems of type X it will get right), and then actually test it. This a priori lands in INTROSPECTION, but could have a bit of FACTS or ID-LEVERAGE if you use tasks described in training data as "hard for LLMs" (like tasks related to tokens and text position).

Comment by Martín Soto (martinsq) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-16T23:55:05.085Z · LW · GW

About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:

You say "use the seed to generate two new random rare words". But if I'm understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it's written, and the closeness of that excerpt to the random seed, I'd expect the LLM to "not notice" this, and automatically "try" to use the random seed to inform the choice of word pair.

Could this be impeding performance? Does it improve if you don't say that misleading bit?

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-07-12T02:38:48.964Z · LW · GW

I've noticed less and less posts include explicit Acknowledgments or Epistemic Status.

This could indicate that the average post has less work put into it: it hasn't gone through an explicit round of feedback from people you'll have to acknowledge. Although this could also be explained by the average poster being more isolated.

If it's true less work is put into the average post, it seems likely this means that kind of work and discussion has just shifted to private channels like Slack, or more established venues like academia.

I'd guess the LW team have their ways to measure or hypothesize about how much work is put into posts.

It could also be related to the average reader wanting to skim many things fast, as opposed to read a few deeply.

My feeling is that now we all assume by default that the epistemic status is tentative (except in obvious cases like papers).

It could also be that some discourse has become more polarized, and people are less likely to explicitly hedge their position through an epistemic status.

Or that the average reader being less isolated and thus more contextualized, and not as in need of epistemic hedges.

Or simply that less posts nowadays are structured around a central idea or claim, and thus different parts of the post have different epistemic statuses to be written at the top.

It could also be that post types have become more standardized, and each has their reason not to include these sections. For example:

  • Papers already have acknowledgments, and the epistemic status is diluted through the paper.
  • Stories or emotion-driven posts don't want to break the mood with acknowledgments (and don't warrant epistemic status).
Comment by Martín Soto (martinsq) on Looking back on my alignment PhD · 2024-07-08T03:55:49.026Z · LW · GW

This post is not only useful, but beautiful.

This, more than anything else on this website, reflects for me the lived experiences which demonstrate we can become more rational and effective at helping the world.

Many points of resonance with my experience since discovering this community. Many same blind-spots that I unfortunately haven't been able to shortcut, and have had to re-discover by myself. Although this does make me wish I had read some of your old posts earlier.

Comment by Martín Soto (martinsq) on Technologies and Terminology: AI isn't Software, it's... Deepware? · 2024-07-01T01:12:13.379Z · LW · GW

It should be called A-ware, short for Artificial-ware, given the already massive popularity of the term "Artificial Intelligence" to designate "trained-rather-than-programmed" systems.

It also seems more likely to me that future products will contain some AI sub-parts and some traditional-software sub-parts (rather than being wholly one or the other), and one or the other is utilized depending on context. We could call such a system Situationally A-ware.

Comment by Martín Soto (martinsq) on Daniel Kokotajlo's Shortform · 2024-05-24T14:54:05.924Z · LW · GW

That was dazzling to read, especially the last bit.

Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T13:12:17.162Z · LW · GW

Everything makes sense except your second paragraph. Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well. But we shouldn't condition on solving alignment, because we haven't yet.

Thus, in our current situation, the only way anthropics pushes us towards "we should work more on non-agentic systems" is if you believe "world were we still exist are more likely to have easy alignment-through-non-agentic-AIs". Which you do believe, and I don't. Mostly because I think in almost no worlds we have been killed by misalignment at this point. Or put another way, the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us (and information in the current regime doesn't extrapolate much to the next one).

Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T12:23:14.365Z · LW · GW

Yes, but

  1. This update is screened off by "you actually looking at the past and checking whether we got lucky many times or there is a consistent reason". Of course, you could claim that our understanding of the past is not perfect, and thus should still update, only less so. Although to be honest, I think there's a strong case for the past clearly showing that we just got lucky a few times.
  2. It sounded like you were saying the consistent reason is "our architectures are non-agentic". This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us (instead of killing us in the next decade). I'm not of this opinion. And if I was, I'd need to take into account factors like "how much faster I'd have expected capabilities to advance", etc.
Comment by Martín Soto (martinsq) on quila's Shortform · 2024-05-12T12:08:13.253Z · LW · GW

Under the anthropic principle, we should expect there to be a 'consistent underlying reason' for our continued survival.


Why? It sounds like you're anthropic updating on the fact that we'll exist in the future, which of course wouldn't make sense because we're not yet sure of that. So what am I missing?

Comment by Martín Soto (martinsq) on DanielFilan's Shortform Feed · 2024-05-09T08:22:32.896Z · LW · GW

Interesting, but I'm not sure how successful the counterexample is. After all, if your terminal goal in the whole environment was truly for your side to win, then it makes sense to understand anything short of letting Shin play as a shortcoming of your optimization (with respect to that goal). Of course, even in the case where that's your true goal and you're committing a mistake (which is not common), we might want to say that you are deploying a lot of optimization, with respect to the different goal of "winning by yourself", or "having fun", which is compatible with failing at another goal.
This could be taken to absurd extremes (whatever you're doing, I can understand you as optimizing really hard for doing exactly what you're doing), but the natural way around that is for your imputed goals to be required simple (in some background language or ontology, like that of humans). This is exactly the approach mathematically taken by Vanessa in the past (the equation at 3:50 here).
I think this "goal relativism" is fundamentally correct. The only problem with Vanessa's approach is that it's hard to account for the agent being mistaken (for example, you not knowing Shin is behind you).[1]
I think the only natural way to account for this is to see things from the agent's native ontology (or compute probabilities according to their prior), however we might extract those from them. So we're unavoidably back at the problem of ontology identification (which I do think is the core problem).

  1. ^

    Say Alice has lived her whole life in a room with a single button. People from the outside told her pressing the button would create nice paintings. Throughout her life, they provided an exhaustive array of proofs and confirmations of this fact. Unbeknownst to her, this was all an elaborate scheme, and in reality pressing the button destroys nice paintings. Alice, liking paintings, regularly presses the button.
    A naive application of Vanessa's criterion would impute Alice the goal of destroying paintings. To avoid this, we somehow need to integrate over all possible worlds Alice can find herself in, and realize that, when you are presented with an exhaustive array of proofs and confirmations that the button creates paintings, it is on average more likely for the button to create paintings than destroy them.
    But we face a decision. Either we fix a prior to do this that we will use for all agents, in which case all agents with a different prior will look silly to us. Or we somehow try to extract the agent's prior, and we're back at ontology identification.

    (Disclaimer: This was SOTA understanding a year ago, unsure if it still is now.)

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-05-04T09:49:36.981Z · LW · GW

Claude learns across different chats. What does this mean?

 I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.

In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.

Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as "append previous chats as the start of the prompt, with a certain indicator that they are different", but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?

(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason "more intelligent" and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)

I've checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn't have it.
Claude also explicitly states it doesn't have cross-chat memory, when asked about it.
Might something happen like "it does have some chat memory, but it's told not to acknowledge this fact, but it sometimes slips"?

Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.

Comment by Martín Soto (martinsq) on William_S's Shortform · 2024-05-04T09:28:52.452Z · LW · GW

What's PPU?

Comment by Martín Soto (martinsq) on An explanation of evil in an organized world · 2024-05-02T08:03:20.008Z · LW · GW

I'm so happy someone came up with this!

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-04-11T21:04:34.853Z · LW · GW

Wow, I guess I over-estimated how absolutely comedic the title would sound!

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-04-11T20:33:54.823Z · LW · GW

In case it wasn't clear, this was a joke.

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-04-11T18:17:28.798Z · LW · GW

AGI doom by noise-cancelling headphones:                                                                            

ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.

(In case it wasn't clear, this is a joke.)

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-21T01:09:41.713Z · LW · GW

they need to reward outcomes which only they can achieve,

Yep! But this didn't seem so hard for me to happen, especially in the form of "I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever". You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they're the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.

it should be something which has diminishing marginal return to spending

Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I'm happy to make that trade-off).

But actually I don't think that this is a "dominant dynamic" because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews

Yeah. To be clear, the dynamic I think is "dominant" is "learning to learn better". Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-21T00:42:50.927Z · LW · GW

There's no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm.

Yes, absolutely! I just meant that, once you give me whatever V you choose to derive U from observations, I will just be able to apply UDT on top of that. So under this framework there doesn't seem to be anything new going on, because you are just choosing an algorithm V at the start of time, and then treating its outputs as observations. That's, again, why this only feels like a good model of "completely crystallized rigid values", and not of "organically building them up slowly, while my concepts and planner module also evolve, etc.".[1]

definitely doesn't imply "you get mugged everywhere"

Wait, but how does your proposal differ from EV maximization (with moral uncertainty as part of the EV maximization itself, as I explain above)?

Because anything that is doing pure EV maximization "gets mugged everywhere". Meaning if you actually have the beliefs (for example, that the world where suffering is hard to produce could exist), you just take those bets.
Of course if you don't have such "extreme" beliefs it doesn't, but then we're not talking about decision-making, and instead belief-formation. You could say "I will just do EV maximization, but never have extreme beliefs that lead to suspiciously-looking behavior", but that'd be hiding the problem under belief-formation, and doesn't seem to be the kind of efficient mechanism that agents really implement to avoid these failure modes.

  1. ^

    To be clear, V can be a very general algorithm (like "run a copy of me thinking about ethics"), so that this doesn't "feel like" having rigid values. Then I just think you're carving reality at the wrong spot. You're ignoring the actual dynamics of messy value formation, hiding them under V.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-21T00:32:29.010Z · LW · GW

I'd actually represent this as "subsidizing" some traders

Sounds good!

it's more a question of how you tweak the parameters to make this as unlikely as possible

Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don't fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like "dynamically changing the prior over traders"[1].

I'm assuming that traders can choose to ignore whichever inputs/topics they like, though. They don't need to make trades on everything if they don't want to.

Yep, that's why I believe "in the limit your traders will already do this". I just think it will be a dominant dynamic of efficient agents in the real world, so it's better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that's how real agents probably do it, computationally speaking.

  1. ^

    Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-20T23:57:11.805Z · LW · GW

But you need some mechanism for actually updating your beliefs about U

Yep, but you can just treat it as another observation channel into UDT. You could, if you want, treat it as a computed number you observe in the corner of your eye, and then just apply UDT maximizing U, and you don't need to change UDT in any way.

UDT says to pay here

(Let's not forget this depends on your prior, and we don't have any privileged way to assign priors to these things. But that's a tangential point.)

I do agree that there's not any sharp distinction between situations where it "seems good" and situations where it "seems bad" to get mugged. After all, if all you care about is maximizing EV, then you should take all muggings. It's just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go "hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don't really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal". You start to see how your abstractions might break, and how you can't get any satisfying notion of "complete updatelessness" (that doesn't go against important intuitions). And you start to rethink whether this is what we normatively want, nor what we realistically see in agents.

Comment by Martín Soto (martinsq) on Comparing Alignment to other AGI interventions: Basic model · 2024-03-20T23:25:21.859Z · LW · GW

You're right, I forgot to explicitly explain that somewhere! Thanks for the notice, it's now fixed :)

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-20T23:19:36.917Z · LW · GW

I like this picture! But

Voting on what actions get reward

I think real learning has some kind of ground-truth reward. So we should clearly separate between "this ground-truth reward that is chiseling the agent during training (and not after training)", and "the internal shards of the agent negotiating and changing your exact objective (which can happen both during and after training)". I'd call the latter "internal value allocation", or something like that. It doesn't neatly correspond to any ground truth, and is partly determined by internal noise in the agent. And indeed, eventually, when you "stop training" (or at least "get decoupled enough from reward"), it just evolves of its own, separate from any ground truth.

And maybe more importantly:

  • I think this will by default lead to wireheading (a trader becomes wealthy and then sets reward to be very easy for it to get and then keeps getting it), and you'll need a modification of this framework which explains why that's not the case.
  • My intuition is a process of the form "eventually, traders (or some kind of specialized meta-traders) change the learning process itself to make it more efficient". For example, they notice that topic A and topic B are unrelated enough, so you can have the traders thinking about these topics be pretty much separate, and you don't lose much, and you waste less compute. Probably these dynamics will already be "in the limit" applied by your traders, but it will be the dominant dynamic so it should be directly represented by the formalism.
  • Finally, this might come later, and not yet in the level of abstraction you're using, but I do feel like real implementations of these mechanisms will need to have pretty different, way-more-local structure to be efficient at all. It's conceivable to say "this is the ideal mechanism, and real agents are just hacky approximations to it, so we should study the ideal mechanism first". But my intuition says, on the contrary, some of the physical constraints (like locality, or the architecture of nets) will strongly shape which kind of macroscopic mechanism you get, and these will present pretty different convergent behavior. This is related, but not exactly equivalent to, partial agency.
Comment by Martín Soto (martinsq) on Policy Selection Solves Most Problems · 2024-03-20T23:05:26.458Z · LW · GW

It certainly seems intuitively better to do that (have many meta-levels of delegation, instead of only one), since one can imagine particular cases in which it helps. In fact we did some of that (see Appendix E).

But this doesn't really fundamentally solve the problem Abram quotes in any way. You add more meta-levels in-between the selector and the executor, thus you get more lines of protection against updating on infohazards, but you also get more silly decisions from the very-early selector. The trade-off between infohazard protection and not-being-silly remains. The quantitative question of "how fast should f grow" remains.

And of course, we can look at reality, or also check our human intuitions, and discover that, for some reason, this or that kind of f, or kind of delegation procedure, tends to work better in our distribution. But the general problem Abram quotes is fundamentally unsolvable. "The chaos of a too-early market state" literally equals "not having updated on enough information". "Knowledge we need to be updateless toward" literally equals "having updated on too much information". You cannot solve this problem in full generality, except if you already know exactly what information you want to update on... which means, either already having thought long and hard about it (thus you updated on everything), or you lucked into the right prior without thinking.

Thus, Abram is completely right to mention that we have to think about the human prior, and our particular distribution, as opposed to search for a general solution that we can prove mathematical things about.

Comment by Martín Soto (martinsq) on Richard Ngo's Shortform · 2024-03-20T22:53:15.513Z · LW · GW

People back then certainly didn't think of changing preferences.

Also, you can get rid of this problem by saying "you just want to maximize the variable U". And the things you actually care about (dogs, apples) are just "instrumentally" useful in giving you U. So for example, it is possible in the future you will learn dogs give you a lot of U, or alternatively that apples give you a lot of U.
Needless to say, this "instrumentalization" of moral deliberation is not how real agents work. And leads to getting Pascal's mugged by the world in which you care a lot about easy things.

It's more natural to model U as a logically uncertain variable, freely floating inside your logical inductor, shaped by its arbitrary aesthetic preferences. This doesn't completely miss the importance of reward in shaping your values, but it's certainly very different to how frugally computable agents do it.

I simply think the EV maximization framework breaks here. It is a useful abstraction when you already have a rigid enough notion of value, and are applying these EV calculations to a very concrete magisterium about which you can have well-defined estimates.
Otherwise you get mugged everywhere. And that's not how real agents behave.

Comment by Martín Soto (martinsq) on Comparing Alignment to other AGI interventions: Basic model · 2024-03-20T22:41:04.886Z · LW · GW

My impression was that this one model was mostly Hjalmar, with Tristan's supervision. But I'm unsure, and that's enough to include anyway, so I will change that, thanks :)

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-03-19T18:22:05.378Z · LW · GW

Brain-dump on Updatelessness and real agents                            

Building a Son is just committing to a whole policy for the future. In the formalism where our agent uses probability distributions, and ex interim expected value maximization decides your action... the only way to ensure dynamic stability (for your Son to be identical to you) is to be completely Updateless. That is, to decide something using your current prior, and keep that forever.

Luckily, real agents don't seem to work like that. We are more of an ensemble of selected-for heuristics, and it seems true scope-sensitive complete Updatelessnes is very unlikely to come out of this process (although we do have local versions of non-true Updatelessness, like retributivism in humans).
In fact, it's not even exactly clear how I would use my current brain-state could decide something for the whole future. It's not even well-defined, like when you're playing a board-game and discover some move you were planning isn't allowed by the rules. There are ways to actually give an exhaustive definition, but I suspect the ones that most people would intuitively like (when scrutinized) are sneaking in parts of Updatefulness (which I think is the correct move).

More formally, it seems like what real-world agents do is much better-represented by what I call "Slow-learning Policy Selection". (Abram had a great post about this called "Policy Selection Solves Most Problems", which I can't find now.) This is a small agent (short computation time) recommending policies for a big agent to follow in the far future. But the difference with complete Updatelessness is that the small agent also learns (much more slowly than the big one). Thus, if the small agent thinks a policy (like paying up in Counterfactual Mugging) is the right thing to do, the big agent will implement this for a pretty long time. But eventually the small agent might change its mind, and start recommending a different policy. I basically think that all problems not solved by this are unsolvable in principle, due to the unavoidable trade-off between updating and not updating.[1]

This also has consequences for how we expect superintelligences to be. If by them having “vague opinions about the future” we mean a wide, but perfectly rigorous and compartmentalized probability distribution over literally everything that might happen, then yes, the way to maximize EV according to that distribution might be some very concrete, very risky move, like re-writing to an algorithm because you think simulators will reward this, even if you’re not sure how well that algorithm performs in this universe.
But that’s not how abstractions or uncertainty work mechanistically! Abstractions help us efficiently navigate the world thanks to their modular, nested, fuzzy structure. If they had to compartmentalize everything in a rigorous and well-defined way, they’d stop working. When you take into account how abstractions really work, the kind of partial updatefulness we see in the world is what we'd expect. I might write about this soon.

  1. ^

    Surprisingly, in some conversations others still wanted to "get both updatelessness and updatefulness at the same time". Or, receive the gains from Value of Information, and also those from Strategic Updatelessness. Which is what Abram and I had in mind when starting work. And is, when you understand what these words really mean, impossible by definition.

Comment by Martín Soto (martinsq) on 'Empiricism!' as Anti-Epistemology · 2024-03-16T22:17:54.922Z · LW · GW

Cool connections! Resonates with how I've been thinking about intelligence and learning lately.
Some more connections:

Indeed, those savvier traders might even push me to go look up that data (using, perhaps, some kind of internal action auction), in order to more effectively take the simple trader's money

That's reward/exploration hacking.
Although I do think most times we "look up some data" in real life it's not due to an internal heuristic / subagent being strategic enough to purposefully try and exploit others, but rather just because some earnest simple heuristics recommending to look up information have scored well in the past.

They haven't taken its money yet," said the Scientist, "But they will before it gets a chance to invest any of my money

I think this doesn't always happen. As good as the internal traders might be, the agent sometimes needs to explore, and that means giving up some of the agent's money.

Now, if I were an ideal Garrabrant inductor I would ignore these arguments, and only pay attention to these new traders' future trades. But I have not world enough or time for this; so I've decided to subsidize new traders based on how they would have done if they'd been trading earlier.

Here (starting at "Put in terms of Logical Inductors") I mention other "computational shortcuts" for inductors. Mainly, if two "categories of bets" seem pretty unrelated (they are two different specialized magisteria), then not having thick trade between them won't lose you out on much performance (and will avoid much computation).
You can have "meta-traders" betting on which categories of bets are unrelated (and testing them but only sparsely, etc.), and use them to make your inductor more computationally efficient. Of course object-level traders already do this (decide where to look, etc.), and in the limit this will converge like a Logical Inductor, but I have the intuition this will converge faster (at least, in structured enough domains).
This is of course very related to my ideas and formalism on meta-heuristics.

helps prevent clever arguers from fooling me (and potentially themselves) with overfitted post-hoc hypotheses

This adversarial selection is also a problem for heuristic arguments: Your heuristic estimator might be very good at assessing likelihoods given a list of heuristic arguments, but what if the latter has been selected against your estimator, top drive it in a wrong direction?
Last time I discussed this with them (very long ago), they were just happy to pick an apparently random process to generate the heuristic arguments, that they're confident enough hasn't been tampered with.
Something more ambitious would be to have the heuristic estimator also know about the process that generated the list of heuristic arguments, and use these same heuristic arguments to assess whether something fishy is going on. This will never work perfectly, but probably helps a lot in practice.
(And I think this is for similar reasons to why deception might be hard: When not the output, but also the "thoughts", of the generating process are scrutinized, it seems hard for it to scheme without being caught.)

Comment by Martín Soto (martinsq) on How disagreements about Evidential Correlations could be settled · 2024-03-11T19:30:39.047Z · LW · GW

I think it would be helpful to have a worked example here -- say, the twin PD

As in my A B C example, I was thinking of the simpler case in which two agents disagree about their joint correlation to a third. If the disagreement happens between two sides of a twin PD, then they care about slightly different questions (how likely A is to Cooperate if B Cooperates, and how likely B is to Cooperate if A Cooperates), instead of the same question. And this presents complications in exposition. Although, if they wanted to, they could still share their heuristics, etc.

To be clear, I didn't provide a complete specification of "what action a and action c are" (which game they are playing), just because it seemed to distract from the topic. That is, the relevant part is their having different beliefs on any correlation, not its contents.

Uh oh, this is starting to sound like Oesterheld's Decision Markets stuff. 

Yes! But only because I'm directly thinking of Logical Inductors, which are the same for epistemics. Better said, Caspar throws everything (epistemics and decision-making) into the traders, and here I am still using Inductors, which only throw epistemics into the traders.

My point is:
"In our heads, we do logical learning by a process similar to Inductors. To resolve disagreements about correlations, we can merge our Inductors in different ways. Some are lower-bandwidth and frugal, while others are higher-bandwidth and expensive."
Exactly analogous points could be made about our decision-making (instead of beliefs), thus the analogy would be to Decision Markets instead of Logical Inductors.

Comment by Martín Soto (martinsq) on nielsrolf's Shortform · 2024-03-10T02:58:13.236Z · LW · GW

Sounds a lot like this, or also my thoughts, or also shard theory!

Comment by Martín Soto (martinsq) on Evidential Correlations are Subjective, and it might be a problem · 2024-03-07T19:10:00.107Z · LW · GW

Thanks for the tip :)

Yes, I can certainly argue that. In a sense, the point is even deeper: we have some intuitive heuristics for what it means for players to have "similar algorithms", but what we ultimately care about is how likely it is that if I cooperate you cooperate, and when I don't cooperate, there's no ground truth about this. It is perfectly possible for one or both of the players to (due to their past logical experiences) believe they are not correlated, AND (this is the important point) if they thus don't cooperate, this belief will never be falsified. This "falling in a bad equilibrium by your own fault" is exactly the fundamental problem with FixDT (and more in general, fix-points and action-relevant beliefs).

More realistically, both players will continue getting observations about ground-truth math and playing games with other players, and so the question becomes whether these learnings will be enough to quick them out of any dumb equilibria.

Comment by Martín Soto (martinsq) on Evidential Correlations are Subjective, and it might be a problem · 2024-03-07T19:04:32.350Z · LW · GW

I think you're right, see my other short comments below about epsilon-exploration as a realistic solution. It's conceivable that something like "epsilon-exploration plus heuristics on top" groks enough regularities that performance at some finite time tends to be good. But who knows how likely that is.

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-03-06T20:23:52.749Z · LW · GW

I think that's the right next question!

The way I was thinking about it, the mathematical toy model would literally have the structure of microstates and macrostates. What we need is a set of (lawfully, deterministically) evolving microstates in which certain macrostate partitions (macroscopic regularities, like pressure) are statistically maintained throughout the evolution. And then, for my point, we'd need two different macrostate partitions (or sets of macrostate partitions) such that each one is statistically preserved. That is, complex macroscopic patterns it self-replicate (a human tends to stay in the macrostate partition of "the human being alive"). And they are mostly independent (humans can't easily learn about the completely different partition, otherwise they'd already be in the same partition).

In the direction of "not making it trivial", I think there's an irresolvable tension. If by "not making it trivial" you mean "s1 and s2 don't obviously look independent to us", then we can get this, but it's pretty arbitrary. I think the true name of "whether s1 and s2 are independent" is "statistical mutual information (of the macrostates)". And then, them being independent is exactly what we're searching for. That is, it wouldn't make sense to ask for "independent pattern-universes coexisting on the same substrate", while at the same time for "the pattern-universes (macrostate partitions) not to be truly independent".

I think this successfully captures the fact that my point/realization is, at its heart, trivial. And still, possibly deconfusing about the observer-dependence of world-modelling.

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-03-06T20:12:40.783Z · LW · GW

My post is consistent with what Eliezer says there. My post would simply remark:
You are already taking for granted a certain low-level / atomic set of variables = macro-states (like mortal, featherless, biped). Let me bring to your attention that you pay attention to these variables because they are written in a macro-state partition similar / useful to your own. It is conceivable for some external observer to look at low-level physics, and interpret it through different atomic macro-states (different from mortal, featherless, biped).

The same applies to unsupervised learning. It's not surprising that macro-states expressed in a certain language (the computation methods we've built to find simple regularities in certain sets of macroscopic variables). As before, there simply are just already some macro-state partitions we pay attention to, in which these macroscopic variables are expressed (but not others like "the exact position of a particle"), and also in which we build our tools (similarly to how our sensory perceptors are also built in them).

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-03-06T19:58:14.143Z · LW · GW

By random I just meant "no simple underlying regularity explains it shortly". For example, a low-degree polynomial has a very short description length. While a random jumble of points doesn't (you need to write the points one by one). This of course already assumes a language.

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-03-06T19:43:29.024Z · LW · GW

Thank you Sylvester for the academic reference, and Wei for your thoughts!

I do understand from the SEP, like Wei, that sophisticated means "backwards planning", and resolute means "being able to commit to a policy" (correct me if I'm wrong).

My usage of "dynamic instability" (which might be contrary to academic usage) was indeed what Wei mentions: "not needing commitments". Or equivalently, I say a decision theory is dynamically stable if itself and its resolute version always act the same.

There are some ways to formalize exactly what I mean by "not needing commitments", for example see here, page 3, Desiderata 2 (Tiling result), although that definition is pretty in the weeds.

Comment by Martín Soto (martinsq) on Martín Soto's Shortform · 2024-02-28T22:38:10.179Z · LW · GW

Marginally against legibilizing my own reasoning:     

When taking important decisions, I spend too much time writing down the many arguments, and legibilizing the whole process for myself. This is due to completionist tendencies. Unfortunately, a more legible process doesn’t overwhelmingly imply a better decision!

Scrutinizing your main arguments is necessary, although this looks more like intuitively assessing their robustness in concept-space than making straightforward calculations, given how many implicit assumptions they all have. I can fill in many boxes, and count and weigh considerations in-depth, but that’s not a strong signal, nor what almost ever ends up swaying me towards a decision!

Rather than folding, re-folding and re-playing all of these ideas inside myself, it’s way more effective time-wise to engage my System 1 more: intuitively assess the strength of different considerations, try to brainstorm new ways in which the hidden assumptions fail, try to spot the ways in which the information I’ve received is partial… And of course, share all of this with other minds, who are much more likely to update me than my own mind. All of this looks more like rapidly racing through intuitions than filling Excel sheets, or having overly detailed scoring systems.

For example, do I really think I can BOTEC the expected counterfactual value (IN FREAKING UTILONS) of a new job position? Of course a bad BOTEC is better than none, but the extent to which that is not how our reasoning works, and the work is not really done by the BOTEC at all, is astounding. Maybe at that point you should stop calling it a BOTEC.

Comment by Martín Soto (martinsq) on CFAR Takeaways: Andrew Critch · 2024-02-24T22:16:08.904Z · LW · GW

This is pure gold, thanks for sharing!

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-02-21T06:16:14.716Z · LW · GW

Didn't know about ruliad, thanks!

I think a central point here is that "what counts as an observer (an agent)" is observer-dependent (more here) (even if under our particular laws of physics there are some pressures towards agents having a certain shape, etc., more here). And then it's immediate each ruliad has an agent (for the right observer) (or similarly, for a certain decryption of it).

I'm not yet convinced "the mapping function/decryption might be so complex it doesn't fit our universe" is relevant. If you want to philosophically defend "functionalism with functions up to complexity C" instead of "functionalism", you can, but C starts seeming arbitrary?

Also, a Ramsey-theory argument would be very cool.

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T21:33:03.480Z · LW · GW

- Chan master Yunmon

Comment by Martín Soto (martinsq) on Why does generalization work? · 2024-02-20T21:29:14.710Z · LW · GW

Yep! Although I think the philosophical point goes deeper. The algorithm our brains themselves use to find a pattern is part of the picture. It is a kind of "fixed (de/)encryption".

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T20:39:23.519Z · LW · GW

Thank you, habryka!

As mentioned in my answer to Eliezer, my arguments were made with that correct version of updatelessness in mind (not "being scared to learn information", but "ex ante deciding whether to let this action depend on this information"), so they hold, according to me.
But it might be true I should have stressed this point more in the main text.

Comment by Martín Soto (martinsq) on The lattice of partial updatelessness · 2024-02-20T20:36:13.489Z · LW · GW

Yep! I hadn't included pure randomization in the formalism, but it can be done and will yield some interesting insights.

As you mention, we can also include pseudo-randomization. And taking these bounded rationality considerations into account also makes our reasoning richer and more complex: it's unclear exactly when an agent wants to obfuscate its reasoning from others, etc.

Comment by Martín Soto (martinsq) on The lattice of partial updatelessness · 2024-02-20T20:34:03.455Z · LW · GW

First off, that  was supposed to be , sorry.

The agent might commit to "only updating on those things accepted by program ", even when it still doesn't have the complete infinite list of "exactly in which things does  update" (in fact, this is always the case, since we can't hold an infinite list in our head). It will, at the time of committing, know that  updates on certain things, doesn't update on others... and it is uncertain about exactly what it does in all other situations. But that's okay, that's what we do all the time: decide on an endorsed deliberation mechanism based on its structural properties, without yet being completely sure of what it does (otherwise, we wouldn't need the deliberation). But it does advise against committing while being too ignorant.

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T20:29:30.067Z · LW · GW

Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations?

Certainly not, in the most general case, as you correctly point out.

Here I was studying a particular case: updateless agents in a world remotely looking like the real world. And even more particular: thinking about the kinds of priors that superintelligences created in the real world might actually have.

Eliezer believes that, in these particular cases, it's very likely we will get optimal behavior (we won't get trapped priors, nor commitment races). I disagree, and that's what I argue in the post.

I'm also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.

If by "predictable in advance" you mean "from the updateless agent's prior", then nope! Updatelessness maximizes EV from the prior, so it will do whatever looks best from this perspective. If that's what you want, then updatelessness is for you! The problem is, we have many pro tanto reasons to think this is not a good representation of rational decision-making in reality, nor the kind of cognition that survives for long in reality. Because of considerations about "the world being so complex that your prior will be missing a lot of stuff". And in particular, multi-agentic scenarios are something that makes this complexity sky-rocket.
Of course, you can say "but that consideration will also be included in your prior". And that does make the situation better. But eventually your prior needs to end. And I argue, that's much before you have all the necessary information to confidently commit to something forever (but other people might disagree with this).

Comment by Martín Soto (martinsq) on Updatelessness doesn't solve most problems · 2024-02-20T20:23:09.050Z · LW · GW

Is this consistent with the way you're describing decision-making procedures as updateful and updateless?

Absolutely. A good implementation of UDT can, from its prior, decide on an updateful strategy. It's just it won't be able to change its mind about which updateful strategy seems best. See this comment for more.

"flinching away from true information"

As mentioned also in that comment, correct implementations of UDT don't actually flinch away from information: they just decide ex ante (when still not having access to that information) whether or not they will let their future actions depend on it.

The problem remains though: you make the ex ante call about which information to "decision-relevantly update on", and this can be a wrong call, and this creates commitment races, etc.

Comment by Martín Soto (martinsq) on Natural abstractions are observer-dependent: a conversation with John Wentworth · 2024-02-20T20:12:16.679Z · LW · GW

I'm not sure we are in disagreement. No one is negating that the territory shapes the maps (which are part of the territory). The central point is just that our perception of the territory is shaped by our perceptors, etc., and need not be the same. It is still conceivable that, due to how the territory shapes this process (due to the most likely perceptors to be found in evolved creatures, etc.), there ends up being a strong convergence so that all maps represent isomorphically certain territory properties. But this is not a given, and needs further argumentation. After all, it is conceivable for a territory to exist that incentivizes the creation of two very different and non-isomorphic types of maps. But of course, you can argue our territory is not such, by looking at its details.

Where “joint carvy-ness” will end up being, I suspect, related to “gears that move the world,” i.e., the bits of the territory that can do surprisingly much, have surprisingly much reach, etc.

I think this falls for the same circularity I point at in the post: you are defining "naturalness of a partition" as "usefulness to efficiently affect / control certain other partitions", so you already need to care about the latter. You could try to say something like "this one partition is useful for many partitions", but I think that's physically false, by combinatorics (in all cases you can always build as many partitions that are affected by another one). More on these philosophical subtleties here: Why does generalization work?