Posts

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs 2024-07-08T22:24:38.441Z
Self-shutdown AI 2023-08-21T16:48:03.821Z
Localizing goal misgeneralization in a maze-solving policy network 2023-07-06T16:21:03.813Z
Reverse engineering of the simulation 2022-02-07T21:36:35.874Z
What do we *really* expect from a well-aligned AI? 2021-01-04T20:57:52.389Z

Comments

Comment by jan betley (jan-betley) on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-13T10:57:35.719Z · LW · GW

OK, I'll try to make this more explicit:

  • There's an important distinction between "stated preferences" and "revealed preferences"
  • In humans, these preferences are often very different. See e.g. here
  • What they measure in the paper are only stated preferences
  • What people think of when talking about utility maximization is revealed preferences
  • Also when people care about utility maximization in AIs it's about revealed preferences
  • I see no reason to believe that in LLMs stated preferences should correspond to revealed preferences

The only way I know to make a language model "agentic" is to ask it questions about which actions to take

Sure! But taking actions reveals preferences, instead of stating preferences. That's the key difference here.

Comment by jan betley (jan-betley) on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-12T17:59:39.367Z · LW · GW

I just think what you're measuring is very different from what people usually mean by "utility maximization". I like how this X comment says that:

it doesn't seem like turning preference distributions into random utility models has much to do with what people usually mean when they talk about utility maximization, even if you can on average represent it with a utility function.

So, in other words: I don't think claims about utility maximization based on MC questions can be justified. See also Olli's comment.

Anyway, what would be needed beyond your 5.3 section results: show that an AI, in very different agentic environments where its actions have some at least slightly "real" consequences, behaves in a consistent way with some utility function (ideally consistent with the one from your MC questions). This is what utility maximization means for most people.

Comment by jan betley (jan-betley) on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-12T16:44:02.978Z · LW · GW

My question is: why do you say "AI outputs are shaped by utility maximization" instead of "AI outputs to simple MC questions are self-consistent"? Do you believe these two things mean the same, or that they are different and you've shown the first and not only the latter?

Comment by jan betley (jan-betley) on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-12T15:36:21.627Z · LW · GW

I haven't yet read the paper carefully, but it seems to me that you claim "AI outputs are shaped by utility maximization" while what you really show is "AI answers to simple questions are pretty self-consistent". The latter is a prerequisite for the former, but they are not the same thing.

Comment by jan betley (jan-betley) on Daniel Tan's Shortform · 2025-02-10T14:46:43.830Z · LW · GW

This is pretty interesting. Would be nice to have a systematic big-scale evaluation, for two main reasons:

  • Just knowing which model is best could be useful for future steganography evaluations
  • I'm curious whether being in the same family helps (e.g. is it's easier for LLaMA 70b to play against LLaMA 8b or against GPT-4o?).
Comment by jan betley (jan-betley) on Gary Marcus now saying AI can't do things it can already do · 2025-02-10T12:00:52.982Z · LW · GW
  1. GM: AI so far solved only 5 out of 6 Millenium Prize Problems. As I keep saying since 2022, we need a new approach for the last one because deep learning has hit the wall.
Comment by jan betley (jan-betley) on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-21T22:18:51.678Z · LW · GW

Yes, thank you! (LW post should appear relatively soon)

Comment by jan betley (jan-betley) on Tips and Code for Empirical Research Workflows · 2025-01-21T08:03:03.860Z · LW · GW

I have one question:

asyncio is very important to learn for empirical LLM research since it usually involves many concurrent API calls

I've lots of asyncio experience, but I've never seen a reason to use it for concurrent API calls, because concurrent.futures, especially ThreadPoolExecutor, work as well for concurrent API calls and are more convenient than asyncio (you don't need await, you don't need the loop etc).

Am I missing something? Or is this just a matter of taste?

Comment by jan betley (jan-betley) on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T18:33:40.775Z · LW · GW

If you do anything further along these lines, I'd love to know about it!

Unfortunately not, sorry (though I do think this is very interesting!). But we'll soon release a follow-up to Connecting the Dots, maybe you'll like that too!

Comment by jan betley (jan-betley) on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T12:44:51.684Z · LW · GW

Definitely similar, and nice design! I hadn't seen that before, unfortunately. How did the models do on it?

Unfortunately I don't remember much details :/

My vague memories:

  • Nothing impressive, but with CoT you sometimes see examples that clearly show some useful skills
  • You often see reasoning like "I now know f(1) = 2, f(2) = 4. So maybe that multiplies by two. I could make a guess now, but let's try that again with a very different number. Whats f(57)?"
  • Or "I know f(1) = 1, f(2) = 1, f(50) = 2, f(70) = 2, so maybe that function assigns 1 below some threshold and 2 above that threshold. Let's look for the threshold using binary search" (and uses the search correctly)
  • There's very little strategy in terms of "is this a good moment to make a guess?" - sometimes models gather like 15 cases that confirm their initial hypothesis. Maybe that's bad prompting though.

But note that I think my eval is significantly easier than yours.

Also have you seen 'Connecting the Dots'? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.

I even wrote (a small part of) it : ) I'm glad you find it interesting!

Comment by jan betley (jan-betley) on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-16T21:49:32.902Z · LW · GW

I once implemented something a bit similar.

The idea there is simple: there's a hidden int -> int function and an LLM must guess it. It can execute the function, i.e. provide input and observe the output. To guess the function in a reasonable numer of steps it needs to generate and test hypotheses that narrow down the range of possible functions.

Comment by jan betley (jan-betley) on How do you deal w/ Super Stimuli? · 2025-01-14T17:08:53.366Z · LW · GW

My solution:

  1. Make starting inconvenient. I have no FB app/tiktok/YT on my phone. I can log to FB in a browser, but I intentionally set some random password I don't remember so each time I need to go through password recovery process.
  2. What to do when you started. Whenever I find a moment when I am strong enough to stop watching, I also log out/uninstall app, i.e. revert to the state when starting was inconvenient.
  3. When starting is inconvenient, I have these 30 second or so to reflect on "where will that lead?" and this is usually enough to not start.

This works well for me. I watch short videos on days when I'm too tired to do anything productive - this is quite pleasant and relaxing then - but I learned to force myself to logout later, and this is enough.

Comment by jan betley (jan-betley) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2025-01-13T13:34:45.619Z · LW · GW

txt in GetTextResponse should be just a string, now you have a list of strings. I'm not saying this is the only problem : ) See also https://github.com/LRudL/evalugator/blob/1787ab88cf2e4cdf79d054087b2814cc55654ec2/evalugator/api/providers/openai.py#L207-L222

Comment by jan betley (jan-betley) on The Intelligence Curse · 2025-01-04T15:21:36.079Z · LW · GW

Maybe I'm missing something important, but I think AGI won't be much like a resource and also I don't think we'll see rentrier entities. I'm not saying it will be better though.

The key thing about oil or coal is that it's already there, you roughly know how much it's worth and this value won't change much whatever you do (or don't do). With AI this is different, because all the time you'll have many competitors trying to create a new AI that is either stronger or cheaper than yours. It's not that the deeper you dig the more & better oil you get.

So you can't really become a rentier - you must spend all resources on charging forward, because if you don't, you'll be left behind forever. If we assume there's no single entity that takes lead and restricts competition forever, this might lead to some version of ascended economy. That's probably even worse for an average human: AGI rentrier won't care about you, but also won't care much about your small field full of potatoes that will hopefully let you survive one winter more.

Comment by jan betley (jan-betley) on Which things were you surprised to learn are not metaphors? · 2024-11-24T03:34:27.101Z · LW · GW

Oh yeah. How do I know I'm angry? My back is stiff and starts to hurt.

Comment by jan betley (jan-betley) on Seven lessons I didn't learn from election day · 2024-11-15T08:05:41.054Z · LW · GW
Comment by jan betley (jan-betley) on Seven lessons I didn't learn from election day · 2024-11-15T07:50:07.754Z · LW · GW

The second reason that I don't trust the neighbor method is that people just... aren't good at knowing who a majority of their neighbors are voting for. In many cases it's obvious (if over 70% of your neighbors support one candidate or the other, you'll probably know). But if it's 55-45, you probably don't know which direction it's 55-45 in.

My guess is that there's some postprocessing here. E.g. if you assume that the "neighbor" estimate is wrong but without the refusal problem, and you have the same data from the previous election, then you could estimate the shift of opinions and apply that to other pools that ask about your vote. Or you could ask some additional question like "who did your neighbours vote for in the previous election" and compare that to the real data (ideally per county or so). I would be very surprised if they based the bets just on the raw results.

Comment by jan betley (jan-betley) on The lying p value · 2024-11-12T14:49:46.929Z · LW · GW

My initial answer to your starting question was "I disagree with this statement because they likely used 20 different ways of calculating the p-value and selected the one that was statistically significant". Also https://en.m.wikipedia.org/wiki/Replication_crisis

Comment by jan betley (jan-betley) on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-02T00:15:42.168Z · LW · GW

I don't think there's a lot of value in distinguishing 3000 and 1,000,000 and probably for any aggregate you'll want to show this will just be "later than 2200" or something like that. But yes this way they can't make a statement that this will be 1,000,000 which is some downside.

I'm not a big fan of looking at the neighbors to decide whether this is a missing answer or high estimate (it's OK to not want to answer this one question). So some N/A or -1 should be ok.

(Just to make it clear, I'm not saying this is an important problem)

Comment by jan betley (jan-betley) on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T23:50:43.655Z · LW · GW

I think you can also ask them to put some arbitrarily large number (3000 or 10000 or so) and then just filter out all the numbers above some threshold.

Comment by jan betley (jan-betley) on 2024 Unofficial LW Community Census, Request for Comments · 2024-11-01T22:49:17.205Z · LW · GW

By what year do you think the Singularity will occur? Answer such that you think, conditional on the Singularity occurring, there is an even chance of the Singularity falling before or after this year. If you think a singularity is so unlikely you don't even want to condition on it, leave this question blank

You won't be able to distinguish people who think singularity is super unlikely from people who just didn't want to answer this question for whatever reason (maybe they forgot or thought it's boring or whatever).

Comment by jan betley (jan-betley) on Music in the AI World · 2024-08-16T14:03:23.296Z · LW · GW

Anyway, in such a world some people would probably evolve music that is much more interesting to the public

I wouldn't be so sure.

I think the current diversity of music is largely caused by artists' different lived experiences. You feel something, this is important for you, you try to express that via music. As long as AIs don't have anything like "unique experiences" on the scale of humans, I'm not sure if they'll be able to create music that is that diverse (and thus interesting).

I assume the scenario you described, not a personal AI trained on all your life. With that, it could work.

(Note that I mostly think about small bands, not popular-music-optimised-for-wide-publicity).

Comment by jan betley (jan-betley) on Value fragility and AI takeover · 2024-08-12T19:00:51.667Z · LW · GW

Are there other arguments for active skepticism about Multipolar value fragility? I don’t have a ton of great stories

The story looks for me roughly like this:

  1. The agents won't have random values - they will be somehow scattered around the current status quo.
  2. Therefore the future we'll end up in should not be that far from the status quo.
  3. Current world is pretty good.

(I'm not saying I'm sure this is correct, it's just roughly how I think about that)

Comment by jan betley (jan-betley) on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-18T12:00:52.317Z · LW · GW

This is interesting! Although I think it's pretty hard to use that in a benchmark (because you need a set of problems assigned to clearly defined types and I'm not aware of any such dataset).

There are some papers on "do models know what they know", e.g. https://arxiv.org/abs/2401.13275 or https://arxiv.org/pdf/2401.17882.

Comment by jan betley (jan-betley) on Book Review: Going Infinite · 2023-10-30T07:49:48.832Z · LW · GW

A shame Sam didn't read this:

But if you are running on corrupted hardware, then the reflective observation that it seems like a righteous and altruistic act to seize power for yourself—this seeming may not be be much evidence for the proposition that seizing power is in fact the action that will most benefit the tribe.

Comment by jan betley (jan-betley) on Localizing goal misgeneralization in a maze-solving policy network · 2023-10-16T18:52:40.155Z · LW · GW

Thanks! Indeed, shard theory fits here pretty well. I didn't think about that while writing the post.

Comment by jan betley (jan-betley) on Against Almost Every Theory of Impact of Interpretability · 2023-08-19T20:08:28.554Z · LW · GW

Very good post! I agree with most of what you have written, but I'm not sure about the conclusions. Two main reasons:

  1. I'm not sure if mech interp should be compared to astronomy, I'd say it is more like mechanical engineering. We have JWST because long long time ago there were watchmakers, gunsmiths, opticans etc who didn't care at all about astronomy, yet their advances in unrelated fields made astronomy possible. I think something similar might happen with mech interp - we'll keep creating better and better tools to achieve some goals, these goals will in the end turn up useless from the alignment point of view, but the tools will not.

  2. Many people think mech interp is cool and fun. I'm personally not a big fan, but I think it is much more interesting than e.g. governance. If our only perspective is AI safety, this shouldn't matter - but people have many perspectives. There might not really be a choice between "this bunch of junior researches doing mech interp vs this bunch of junior researchers doing something more useful", they would just go do something not related to alignment instead. My guess is that attractiveness of mech interp is the strongest factor for its popularity.

Comment by jan betley (jan-betley) on Reverse engineering of the simulation · 2022-02-08T20:50:25.371Z · LW · GW

I don't think this answer is in any way related to my question.

This is my fault, because I didn't explain what I exactly mean by the "simulation", and the meaning is different than the most popular one. Details in EDIT in the main post.

Comment by jan betley (jan-betley) on Covid 3/18: An Expected Quantity of Blood Clots · 2021-03-18T18:54:51.291Z · LW · GW

I think EU countries might be calculating something like this: A) go on with AZ --> people keep talking about killer vaccines and how you should never trust the government and that no sane person should vaccinate and "blood clots today, what tomorrow?" B) halt AZ, then say "we checked carefully, everything's fine, we care, we don't want to kill anyone with our vaccine" and start again --> people will trust the vaccines just-a-little-more

And in the long term the general trust in the vaccines is much more important than few weeks delay.

I think you assume that scenario A is also better for the vaccine trust - maybe, I don't know, but I wouldn't be surprised if the European governments were seeing this the other way.

Also, obviously the best solution is "hey people, let's just stop talking about the goddamned blood clots", but The Virtue of Silence (https://www.lesswrong.com/posts/2brqzQWfmNx5Agdrx/the-virtue-of-silence) is not popular enough : )

Comment by jan betley (jan-betley) on What do we *really* expect from a well-aligned AI? · 2021-01-06T14:56:26.194Z · LW · GW

A simple way of rating the scenarios above is to describe them as you have and ask humans what they think.

Do you think this is worth doing?

I thought that

  • either this was done a billion times and I just missed it
  • or this is neither important nor interesting to anyone but me
Comment by jan betley (jan-betley) on What do we *really* expect from a well-aligned AI? · 2021-01-06T14:38:28.868Z · LW · GW

What's wrong with the AI making life into a RPG (or multiple thereof)? People like stories and they like levelling up, collecting stuff, crafting, competing, etc. A story doesn't have to be pure fun (and those sort of stories are boring anyway).

E.g. Eliezer seems to think it's not the perfect future: "The presence or absence of an external puppet master can affect my valuation of an otherwise fixed outcome. Even if people wouldn't know they were being manipulated, it would matter to my judgment of how well humanity had done with its future. This is an important ethical issue, if you're dealing with agents powerful enough to helpfully tweak people's futures without their knowledge".

Also, you write:

If we want to have a shot at creating a truly enduring culture (of the kind that is needed to get us off this planet and out into the galaxy)

If we really want this, we have to restrain from spending our whole lives playing the best RPG possible.

Never mind AI, they're contradictory when executed by us. We aren't robots following a prioritised script and an AI wouldn't be either.

Consider human rules "you are allowed to lie to someone for the sake of their own utility" and "everyone should be able to take control of their own life". We know that lies about serious things never turn out good, so we lie only about things of little importance, and little lies like "yes grandma, that was very tasty" doesn't contradict the second rule. This looks different when you are an ultimate deceiver.

Comment by jan betley (jan-betley) on The Darwin Game · 2020-10-09T11:43:21.935Z · LW · GW

Your TitForTatBot
* never sets self.previous
* even if it was set, it would stop cooperating when opponent played 0

Also I agree with Zvi's comment, why 2.5 for free? This way one should really concentrate on maxing out in the early stage, is it intended?