tamsin-leake

Posts
Comments

Posts

Epistemic states as a potential benign prior 2024-08-31T18:26:14.093Z

How LDT helps reduce the AI arms race 2023-12-10T16:21:44.409Z

We're all in this together 2023-12-05T13:57:46.270Z

formalizing the QACI alignment formal-goal 2023-06-10T03:28:29.541Z

Orthogonal's Formal-Goal Alignment theory of change 2023-05-05T22:36:14.883Z

Orthogonal: A new agent foundations alignment organization 2023-04-19T20:17:14.174Z

Nantes, France – ACX Meetups Everywhere 2021 2021-08-23T08:46:08.899Z

Comments

Comment by Tamsin Leake (carado-1) on Orienting to 3 year AGI timelines · 2024-12-22T17:36:36.944Z · LW · GW

10M$ sounds like it'd be a lot for PauseAI-type orgs imo, though admittedly this is not a very informed take.

Anyways, I stand by my comment; I expect throwing money at PauseAI-type orgs is better utility per dollar than nvidia even after taking into account that investing in nvidia to donate to PauseAI later is a possibility.

Comment by Tamsin Leake (carado-1) on Orienting to 3 year AGI timelines · 2024-12-22T16:39:57.780Z · LW · GW

Thoroughly agree except for what to do with money. I expect that throwing money at orgs that are trying to slow down AI progress (eg PauseAI, or better if someone makes something better) gets you more utility per dollar than nvidia (and also it's more ethical).

Edit: to be clear, I mean actual utility in your utility function. Even if you're fully self-interested and not altruistic at all, I still think your interests are better served by donating to PauseAI-type orgs than investing in nvidia.

Comment by carado-1 on [deleted post] 2024-12-11T07:42:52.126Z

"why would they be doing that?"

same reason people make poor decisions all the time. if they had a clear head and hadn't already sunk some cost into AI, they could see that working on AI might make them wealthy in the short term but it'll increase {the risk that they die soon} enough that they go "not worth it", as they should. but once you're already working in AI stuff, it's tempting and easy to retroactively justify why doing that is safe. or to just not worry about it and enjoy the money, even though if you thought about the impact of your actions on your own survival in the next few years you'd decide to quit.

at least that's my vague best guess.

Comment by carado-1 on [deleted post] 2024-12-11T07:36:56.149Z

people usually think of corporations as either {advancing their own interests and also the public's interests} or {advancing their own interests at cost to the public} — ime mostly the latter. what's actually going on with AI frontier labs, i.e. {going against the interests of everyone including themselves}, is very un-memetic and very far from the overton window.

in fiction, the heads of big organizations are either good (making things good for everyone) or evil (worsening everyone else's outcomes, but improving their own). most of the time, just evil. very rarely are they suicidal fools semi-unknowningly trying to kill everyone including themselves.

and the AI existential risk thing just doesn't stick if you take it as a given that the organizations are acting in their own interest, because dying is not in their own interest.

the public systematically underestimates the foolishness of AI frontier labs.

Comment by Tamsin Leake (carado-1) on (draft) Cyborg software should be open (?) · 2024-11-01T19:59:17.637Z · LW · GW

^^ Why wouldn't people seeing a cool cyborg tool just lead to more cyborg tools? As opposed to the black boxes that big tech has been building?

You imply a cyborg tool is a "powerful unaligned AI", it's not, it's a tool to improve bandwidth and throughput between any existing AI (which remains untouched by cyborg research) and the human

I was making a more general argument that applies mainly to powerful AI but also to all other things that might help one build powerful AI (such as: insights about AI, cyborg tools, etc). These things-that-help have the downside that someone could use them to build powerful but unaligned AI, which is ultimately the thing we want to delay / reduce-the-probability-of. Whether the downside is bad enough that making them public/popular is net bad is the thing that's uncertain, but I lean towards yes, it is net bad.

I believe that:

It is bad for cyborg tools to be broadly available because that'll help {people trying to build the kind of AI that'd kill everyone} more than they'll {help people trying to save the world}.
It is bad for insights about AI to spread because of the same reason.
It is bad for LLM assistants to be broadly available for the same reason.

Only reasonable people who think hard about AI safety will understand the power of cyborgs

I don't think I'm particularly relying on that assumption?? I don't understand what sounded like I think this.

In any case, I'm not making strict "only X are Y" or "all X are Y" statements; I'm making quantitative "X are disproportionately more Y" statements.

That people won't eventually find out.

I believe that capabilities overhang is temporary, that inevitably "the dam will burst"

Well, yes. And at that point the world is much more doomed; the world has to be saved ahead of that. To increase the probability that we have time to save the world before people find out, we want to buy time. I agree it's inevitable, but it can be delayed. Making tools and insights broadly available hastens the bursting of the dam, which is bad; containing them delays the bursting of the dam, which is good.

Comment by Tamsin Leake (carado-1) on (draft) Cyborg software should be open (?) · 2024-11-01T14:23:05.429Z · LW · GW

I think (not sure!) the damage from people/orgs/states going "wow, AI is powerful, I will try to build some" is larger than the upside of people/orgs/states going "wow, AI is powerful, I should be scared of it". It only takes one strong enough one of the former to kill everyone, and the latter is gonna have a very hard time stopping all of them.

By not informing the public that AI is indeed powerful, awareness of that fact is disproportionately allocated to people who will choose to think hard about it on their own, and thus that knowledge is more likely to be in reasonabler hands (for example they'd also be more likely to think "hmm maybe I shouldn't build unaligned powerful AI").

The same goes for cyborg tools, as well as general insights about AI: we should want them to be differentially accessible to alignment people than the general public.

In fact, my biggest criticism of OpenAI is not that they built GPTs, but that they productized it, made it widely available, and created a giant public frenzy about LLMs. I think we'd have more time to solve alignment if they kept it internally and the public wasn't thinking about AI nearly as much.

Comment by Tamsin Leake (carado-1) on The Hopium Wars: the AGI Entente Delusion · 2024-10-13T22:29:53.610Z · LW · GW

Even if tool AI is controllable, tool AI can be used to assist in building non-tool AI. A benign superassistant is one query away from outputting world-ending code.

Comment by Tamsin Leake (carado-1) on If I have some money, whom should I donate it to in order to reduce expected P(doom) the most? · 2024-10-03T13:10:44.192Z · LW · GW

In my opinion the hard part would not be figuring out where to donate to {decrease P(doom) a lot} rather than {decrease P(doom) a little}, but figuring out where to donate to {decrease P(doom)} rather than {increase P(doom)}.

Comment by Tamsin Leake (carado-1) on Shortform · 2024-09-21T05:57:34.056Z · LW · GW

(oops, this ended up being fairly long-winded! hope you don't mind. feel free to ask for further clarifications.)

There's a bunch of things wrong with your description, so I'll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon.

A first pass is to correct your description to the following:

We find a competent honourable human at a particular point in time , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow "signed" as the only . This is part of $H$ — $H$ is not just the human being queried at a particular point in time, it's also the human producing an answer in some way. So $H$ is a function from 1GB bitstring to 1GB bitstring. We define $H^{+}$ as $H$ , followed by whichever new process $H$ describes in its output — typically another instance of $H$ except with a different 1GB payload.
We want a model $M$ of the agent $H^{+}$ . In QACI, we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ after feeding them a bunch of data about the world and the secret key.
We then ask $M$ the question $q$ , "What's the best utility-function-over-policies to maximise?" to get a utility function $U$ $: (O \times A)^{*} \to R$ . We then **ask our solomonoff-like ideal reasoner for their best guess about which action $A$ maximizes $U$ .

Indeed, as you ask in question 3, in this description there's not really a reason to make step 3 an extra thing. The important thing to notice here is that model $M$ might get pretty good, but it'll still have uncertainty.

When you say "we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ ", you're implying that — positing U(M,A) to be the function that says how much utility the utility function returned by model M attributes to action A (in the current history-so-far) — we do something like:

  let M ← oracle(argmax { for model M } 𝔼 { over uncertainty } P(M))
  let A ← oracle(argmax { for action A } U(M, A))
  perform(A)

Indeed, in this scenario, the second line is fairly redundant.

The reason we ask for a utility function is because we want to get a utility function within the counterfactual — we don't want to collapse the uncertainty with an argmax before extracting a utility function, but after. That way, we can do expected-given-uncertainty utility maximization over the full distribution of model-hypotheses, rather than over our best guess about $M$ . We do:

  let A ← oracle(argmax { for A } 𝔼 { for M, over uncertainty } P(M) · U(M, A))
  perform(A)

That is, we ask our ideal reasoner (oracle) for the action with the best utility given uncertainty — not just logical uncertainty, but also uncertainty about which $M$ . This contrasts with what you describe, in which we first pick the most probable $M$ and then calculate the action with the best utility according only to that most-probable pick.

To answer the rest of your questions:

Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?

Unclear! I'm not familiar enough with IDA, and I've bounced off explanations for it I've seen in the past. QACI doesn't feel to me like it particularly involves the concepts of distillation or amplification, but I guess it does involve the concept of iteration, sure. But I don't get the thing called IDA.

Why not replace Step 1 with Strong HCH or some other amplification scheme?

It's unclear to me how one would design an amplification scheme — see concerns of the general shape expressed here. The thing I like about my step 1 is that the QACI loop (well, really, graph (well, really, arbitrary computation, but most of the time the user will probably just call themself in sequence)) is that its setup doesn't involve any AI at all — you could go back in time before the industrial revolution and explain the core QACI idea and it would make sense assuming time-travelling-messages magic, and the magic wouldn't have to do any extrapolating. Just tell someone the idea is that they could send a message to {their past self at a particular fixed point in time}. If there's any amplification scheme, it'll be one designed by the user, inside QACI, with arbitrarily long to figure it out.

What does "bajillion" actually mean in Step 1?

As described above, we don't actually pre-determine the length of the sequence, or in fact the shape of the graph at all. Each iteration decides whether to spawn one or several next iteration, or indeed to spawn an arbitrarily different long-reflection process.

Why are we doing Step 3? Wouldn't it be better to just use M directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.

Why not ask M for the policy π directly? Or some instruction for constructing π? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.

Hopefully my correction above answers these.

What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.

(Again, untractable-to-naively-compute utility function*, not easily-trained-on reward function. If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions?)

I guess this is kinda philosophical? I have some short thoughts on here. If an exabyte is enough to describe to describe {a communication channel with a human-on-earth} to an AI-on-earth, which I think seems likely, then it's enough to build "just have a nice corrigible assistant ask the humans what they want"-type channels.

Put another way: if there are actions which are preferable to other actions, then it seems to me like utility function are a fully lossless way for counterfactual QACI users to express which kinds of actions they want the AI to perform, which is all we need. If there's something wrong with utility function over worlds, then counterfactual QACI users can output a utility function which favors actions which lead to something other than utility maximization over worlds, for example actions which lead to the construction of a superintelligent corrigible assistant which will help the humans come up with a better scheme.

Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign $H$ as $H$ with oracle access to $M$ .

Again, I don't get IDA. Iteration doesn't seem particularly needed? Note that inside QACI, the user does have access to an oracle and to all relevant pieces of hypothesis about which hypothesis it is inhabiting in — this is what, in the QACI math, this line does:

${QACI}_{0}$ 's distribution over answers demands that the answer payload $π_{r}$ , when interpreted as math and with all required contextual variables passed as input ( $q, μ 1, μ 2, α, γ_{q}, ξ$ ).

Notably, $α$ is the hypothesis for which world the user is being considered in, and $γ_{q}, ξ$ for their location within that world. Those are sufficient to fully characterize the hypothesis-for- $H$ that describes them. And because the user doesn't really return just a string but a math function which takes $q, μ 1, μ 2, α, γ_{q}, ξ$ as input and returns a string, they can have that math function do arbitrary work — including rederive $H$ . In fact, rediriving $H$ is how they call a next iteration: they say (except in math) "call $H$ again (rederived using $q, μ 1, μ 2, α, γ_{q}, ξ$ ), but with this string, and return the result of that." See also this illustration, which is kinda wrong in places but gets the recursion call graph thing right.

Another reason to do "iteration" like this inside the counterfactual rather than in the actual factual world (if that's what IDA does, which I'm only guessing here) is that we don't have as many iteration steps as we want in the factual world — eventually OpenAI or someone else kills everyone, whereas in the counterfactual, the QACI users are the only ones who can make progress, so the QACI users essentially have as long as they want, so long as they don't take too long in each individual counterfactual step or other somewhat easily avoided actions like that.

Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from $π$ and ask $M$ to use those trajectories to improve the reward function.

Unclear if this still means anything given the rest of this post. Ask me again if it does.

Comment by Tamsin Leake (carado-1) on The case for more Alignment Target Analysis (ATA) · 2024-09-20T06:44:57.578Z · LW · GW

Hi !

ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.

I agree it's neglected, but there is in fact at least one researh project dedicated to at least designing alignment targets: the part of the formal alignment agenda dedicated to formal outer alignment, which is the design of math problems to which solutions would be world-saving. Our notable attempts at this are QACI and ESP (there was also some work on a QACI2, but it predates (and in-my-opinion is superceded by) ESP).

Those try to implement CEV in math. They only work for doing CEV of a single person or small group, but that's fine: just do CEV of {a single person or small group} which values all of humanity/moral-patients/whatever getting their values satisfied instead of just that group's values. If you want humanity's values to be satisfied, then "satisfying humanity's values" is not opposite to "satisfy your own values", it's merely the outcome of "satisfy your own values".

Comment by Tamsin Leake (carado-1) on Being nicer than Clippy · 2024-09-20T06:38:21.343Z · LW · GW

I wonder how much of those seemingly idealistic people retained power when it was available because they were indeed only pretending to be idealistic. Assuming one is actually initially idealistic but then gets corrupted by having power in some way, one thing someone can do in CEV that you can't do in real life is reuse the CEV process to come up with even better CEV processes which will be even more likely to retain/recover their just-before-launching-CEV values. Yes, many people would mess this up or fail in some other way in CEV; but we only need one person or group who we'd be somewhat confident would do alright in CEV. Plausibly there are at least a few eg MIRIers who would satisfy this. Importantly, to me, this reduces outer alignment to "find someone smart and reasonable and likely to have good goal-content integrity", which is a matter of social & psychology that seems to be much smaller than the initial full problem of formal outer alignment / alignment target design.
One of the main reasons to do CEV is because we're gonna die of AI soon, and CEV is a way to have infinite time to solve the necessary problems. Another is that even if we don't die of AI, we get eaten by various moloch instead of being able to safely solve the necessary problems at whatever pace is necessary.

Comment by Tamsin Leake (carado-1) on Being nicer than Clippy · 2024-09-20T06:23:03.629Z · LW · GW

the main arguments for the programmers including all of [current?] humanity in the CEV "extrapolation base" […] apply symmetrically to AIs-we're-sharing-the-world-with at the time

I think timeless values might possibly help resolve this; if some {AIs that are around at the time} are moral patients, then sure, just like other moral patients around they should get a fair share of the future.

If an AI grabs more resources than is fair, you do the exact same thing as if a human grabs more resources than is fair: satisfy the values of moral patients (including ones who are no longer around) not weighed by how much leverage they current have over the future, but how much leverage they would have over the future if things had gone more fairly/if abuse/powergrab/etc wasn't the kind of thing that gets your more control of the future.

"Sorry clippy, we do want you to get some paperclips, we just don't want you to get as many paperclips as you could if you could murder/brainhack/etc all humans, because that doesn't seem to be a very fair way to allocate the future." — and in the same breath, "Sorry Putin, we do want you to get some of whatever-intrinsic-values-you're-trying-to-satisfy, we just don't want you to get as much as ruthlessly ruling Russia can get you, because that doesn't seem to be a very fair way to allocate the future."

And this can apply regardless of how much of clippy already exists by the time you're doing CEV.

Comment by Tamsin Leake (carado-1) on Being nicer than Clippy · 2024-09-20T06:07:10.703Z · LW · GW

trying to solve morality by themselves

It doesn't have to be by themselves; they can defer to others inside CEV, or come up with better schemes that their initial CEV inside CEV and then defer to that. Whatever other solutions than "solve everything on your own inside CEV" might exist, they can figure those out and defer to them from inside CEV. At least that's the case in my own attempts at implementing CEV in math (eg QACI).

Comment by Tamsin Leake (carado-1) on Lucius Bushnaq's Shortform · 2024-09-18T20:20:17.008Z · LW · GW

Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea.

I guess one of the main things I'm worried about is that it seems to require that we either:

Be really good at timing when we pause it to look at its internals, such that we look at the internals after it's had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all.
Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason.

Current AIs are not representative of what dealing with powerful optimizers is like; when we'll start getting powerful optimizers, they won't sit around long enough for us to look at them and ponder, they'll just quickly eat us.

Comment by Tamsin Leake (carado-1) on Lucius Bushnaq's Shortform · 2024-09-18T15:25:29.881Z · LW · GW

So the formalized concept is Get_Simplest_Concept_Which_Can_Be_Informally_Described_As("QACI is an outer alignment scheme consisting of…") ? Is an informal definition written in english?

It seems like "natural latent" here just means "simple (in some simplicity prior)". If I read the first line of your post as:

Has anyone thought about QACI could be located in some simplicity prior, by searching the prior for concepts matching(??in some way??) some informal description in english?

It sure sounds like I should read the two posts you linked (perhaps especially this one), despite how hard I keep bouncing off of the natural latents idea. I'll give that a try.

Comment by Tamsin Leake (carado-1) on Lucius Bushnaq's Shortform · 2024-09-18T10:22:41.357Z · LW · GW

To me kinda the whole point of QACI is that it tries to actually be fully formalized. Informal definitions seem very much not robust to when superintelligences think about them; fully formalized definitions are the only thing I know of that keep meaning the same thing regardless of what kind of AI looks at it or with what kind of ontology.

I don't really get the whole natural latents ontology at all, and mostly expect it to be too weak for us to be able to get reflectively stable goal-content integrity even as the AI becomes vastly superintelligent. If definitions are informal, that feels to me like degrees of freedom in which an ASI can just pick whichever values make its job easiest.

Perhaps something like this allows use to use current, non-vastly-superintelligent AIs to help design a formalized version of QACI or ESP which itself is robust enough to be passed to superintelligent optimizers; but my response to this is usually "have you tried first formalizing CEV/QACI/ESP by hand?" because it feels like we've barely tried and like reasonable progress can be made on it that way.

Perhaps there are some cleverer schemes where the superintelligent optimizer is pointed at the weaker current-tech-level AI, itself pointed informally at QACI, and we tell the superintelligent optimizer "do what this guy says"; but that seems like it either leaves too many degrees of freedom to the superintelligent optimizer again, or it requires solving corrigibility (the superintelligent optimizer is corrigibly assisting the weaker AI) at which point why not just point the corrigibility at the human directly and ignore QACI altogether, at least to begin with.

Comment by Tamsin Leake (carado-1) on Epistemic states as a potential benign prior · 2024-09-01T06:53:43.453Z · LW · GW

The knightian in IB is related to limits of what hypotheses you can possibly find/write down, not - if i understand so far - about an adversary. The adversary stuff is afaict mostly to make proofs work.

I don't think this makes a difference here? If you say "what's the best not-blacklisted-by-any-knightian-hypothesis action", then it doesn't really matter if you're thinking of your knightian hypotheses as adversaries trying to screw you over by blacklisting actions that are fine, or if you're thinking of your knightian hypotheses as a more abstract worst-case-scenario. In both cases, for any reasonable action, there's probly a knightian hypothesis which blacklists it.

Regardless of whether you think of it as "because adversaries" or just "because we're cautious", knightian uncertainty works the same way. The issue is fundamental to doing maximin over knightian hypotheses.

Comment by Tamsin Leake (carado-1) on Morphism's Shortform · 2024-08-26T23:13:34.468Z · LW · GW

This is indeed a meaningful distinction! I'd phrase it as:

Values about what the entire cosmos should be like
Values about what kind of places one wants one's (future) selves to inhabit (eg, in an internet-like upload-utopia, "what servers does one want to hang out on")

"Global" and "local" is not the worst nomenclature. Maybe "global" vs "personal" values? I dunno.

my best idea is to call the former "global preferences" and the latter "local preferences", but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime

I mean, it's not unrelated! One can view a utility function with both kinds of values as a combination of two utility functions: the part that only cares about the state of the entire cosmos and the part that only cares about what's around them (see also "locally-caring agents").

(One might be tempted to say "consequentialist" vs "experiential", but I don't think that's right — one can still value contact with reality in their personal/local values.)

Comment by Tamsin Leake (carado-1) on Wei Dai's Shortform · 2024-08-26T22:48:58.656Z · LW · GW

That is, in fact, a helpful elaboration! When you said

Most people who "work on AI alignment" don't even think that thinking is a thing.

my leading hypotheses for what you could mean were:

Using thought, as a tool, has not occured to most such people
Most such people have no concept whatsoever of cognition as being a thing, the way people in the year 1000 had no concept whatsoever of javascript being a thing.

Now, instead, my leading hypothesis is that you mean:

Most such people are failing to notice that there's an important process, called "thinking", which humans do but LLMs "basically" don't do.

This is a bunch more precise! For one, it mentions AIs at all.

Comment by Tamsin Leake (carado-1) on quila's Shortform · 2024-07-27T11:09:36.451Z · LW · GW

To be more precise: extrapolated over time, for any undesired selection process or other problem of that kind, either the problem is large enough that it gets exarcerbated over time so much that it eats everything — and then that's just extinction, but slower — or it's not large enough to win out and aligned superintelligence(s) + coordinated human action is enough to stamp it out in the long run, which means they won't be an issue for almost all of the future.

It seems like for a problem to be just large enough that coordination doesn't stamp it away, but also it doesn't eat everything, would be a very fragile equilibrium, and I think that's pretty unlikely.

Comment by Tamsin Leake (carado-1) on quila's Shortform · 2024-06-01T11:16:01.434Z · LW · GW

single-use

Considering how loog it took me to get that by this you mean "not dual-use", I expect some others just won't get it.

Comment by carado-1 on [deleted post] 2024-06-01T09:00:07.205Z

Some people who are very concerned about suffering might be considering building an unaligned AI that kills everyone just to avoid the risk of an AI takeover by an AI aligned to values which want some people to suffer.

Let this be me being on the record saying: I believe the probability of {alignment to values that strongly diswant suffering for all moral patients} is high enough, and the probability of {alignment to values that want some moral patients to suffer} is low enough, that this action is not worth it.

I think this applies to approximately anyone who would read this post, including heads of major labs in case they happen to read this post and in case they're pursuing the startegy of killing everyone to reduce S-risk.

See also: how acausal trade helps in 1, 2, but I think I think this even without acausal trade.

Comment by Tamsin Leake (carado-1) on Non-Disparagement Canaries for OpenAI · 2024-05-30T19:31:37.913Z · LW · GW

sigh I wish people realized how useless it is to have money when the singularity happens. Either we die or we get a utopia in which it's pretty unlikely that pre-singularity wealth matters. What you want to maximize is not your wealth but your utility function, and you sure as hell are gonna get more from LDT handshakes with aligned superintelligences in saved worlds, if you don't help OpenAI reduce the amount of saved worlds.

Comment by carado-1 on [deleted post] 2024-05-18T17:31:53.297Z

I believe that ChatGPT was not released with the expectation that it would become as popular as it did.

Well, even if that's true, causing such an outcome by accident should still count as evidence of vast irresponsibility imo.

Comment by carado-1 on [deleted post] 2024-05-18T16:18:11.918Z

I'm surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don't care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.

Comment by Tamsin Leake (carado-1) on What do you value ? · 2024-05-10T15:45:00.580Z · LW · GW

I made guesses about my values a while ago, here.

Comment by carado-1 on [deleted post] 2024-05-02T20:40:48.223Z

but that this would be bad if the users aren't one of "us"—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.

Rather, " us" — the good alignment researchers who will be careful at all about the long term effects of our actions, unlike capabilities researchers who are happy to accelerate race dynamics and increase p(doom) if they make a quick profit out of it in the short term.

Comment by carado-1 on [deleted post] 2024-05-02T20:30:44.183Z

I am a utilitarian and agree with your comment.

The intent of the post was

to make people weigh whether to publish or not, because I think some people are not weighing this enough
to give some arguments in favor of "you might be systematically overestimating the utility of publishing", because I think some people are doing that

I agree people should take the utilitalianly optimal action, I just think they're doing the utilitarian calculus wrong or not doing the calculus at all.

Comment by carado-1 on [deleted post] 2024-05-02T20:24:59.876Z

I think research that is mostly about outer alignment (what to point the AI to) rather than inner alignment (how to point the AI to it) tends to be good — quantilizers, corrigibility, QACI, decision theory, embedded agency, indirect normativity, infra bayesianism, things like that. Though I could see some of those backfiring the way RLHF did — in the hands of a very irresponsible org, even not very capabilities-related research can be used to accelerate timelines and increase race dynamics if the org doing it thinks it can get a quick buck out of it.

Comment by carado-1 on [deleted post] 2024-05-02T17:33:18.369Z

I don't buy the argument that safety researchers have unusually good ideas/research compared to capability researchers at top labs

I don't think this particularly needs to be true for my point to hold; they only need to have reasonably good ideas/research, not unusually good, for them to publish less to be a positive thing.

That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues.

That's a lot of what I intend to do with this post, yes. I think a lot of people do not think about the impact of publishing very much and just blurt-out/publish things as a default action, and I would like them to think about their actions more.

Comment by carado-1 on [deleted post] 2024-05-02T16:08:44.709Z

One straightforward alternative is to just not do that; I agree it's not very satisfying but it should still be the action that's pursued if it's the one that has more utility.

I wish I had better alternatives, but I don't. But the null action is an alternative.

Comment by carado-1 on [deleted post] 2024-05-02T07:17:31.657Z

It certainly is possible! In more decision-theoritic terms, I'd describe this as "it sure would suck if agents in my reference class just optimized for their own happiness; it seems like the instrumental thing for agents in my reference class to do is maximize for everyone's happiness". Which is probly correct!

But as per my post, I'd describe this position as "not intrinsically altruistic" — you're optimizing for everyone's happiness because "it sure would sure if agents in my reference class didn't do that", not because you intrinsically value that everyone be happy, regardless of reasoning about agents and reference classes and veils of ignorance.

Comment by carado-1 on [deleted post] 2024-04-27T17:25:28.164Z

decision theory is no substitute for utility function

some people, upon learning about decision theories such as LDT and how it cooperates on problems such as the prisoner's dilemma, end up believing the following:

my utility function is about what i want for just me; but i'm altruistic (/egalitarian/cosmopolitan/pro-fairness/etc) because decision theory says i should cooperate with other agents. decision theoritic cooperation is the true name of altruism.

it's possible that this is true for some people, but in general i expect that to be a mistaken analysis of their values.

decision theory cooperates with agents relative to how much power they have, and only when it's instrumental.

in my opinion, real altruism (/egalitarianism/cosmopolitanism/fairness/etc) should be in the utility function which the decision theory is instrumental to. i actually intrinsically care about others; i don't just care about others instrumentally because it helps me somehow.

some important aspects that my utility-function-altruism differs from decision-theoritic-cooperation includes:

i care about people weighed by moral patienthood, decision theory only cares about agents weighed by negotiation power. if an alien superintelligence is very powerful but isn't a moral patient, then i will only cooperate with it instrumentally (for example because i care about the alien moral patients that it has been in contact with); if cooperating with it doesn't help my utility function (which, again, includes altruism towards aliens) then i won't cooperate with that alien superintelligence. corollarily, i will take actions that cause nice things to happen to people even if they've very impoverished (and thus don't have much LDT negotiation power) and it doesn't help any other aspect of my utility function than just the fact that i value that they're okay.
if i can switch to a better decision theory, or if fucking over some non-moral-patienty agents helps me somehow, then i'll happily do that; i don't have goal-content integrity about my decision theory. i do have goal-content integrity about my utility function: i don't want to become someone who wants moral patients to unconsentingly-die or suffer, for example.
there seems to be a sense in which some decision theories are better than others, because they're ultimately instrumental to one's utility function. utility functions, however, don't have an objective measure for how good they are. hence, moral anti-realism is true: there isn't a Single Correct Utility Function.

decision theory is instrumental; the utility function is where the actual intrinsic/axiomatic/terminal goals/values/preferences are stored. usually, i also interpret "morality" and "ethics" as "terminal values", since most of the stuff that those seem to care about looks like terminal values to me. for example, i will want fairness between moral patients intrinsically, not just because my decision theory says that that's instrumental to me somehow.

Comment by Tamsin Leake (carado-1) on LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!") · 2024-04-26T18:52:45.280Z · LW · GW

I would feel better about this if there was something closer to (1) on which to discuss what is probably the most important topic in history (AI alignment). But noted.

Comment by Tamsin Leake (carado-1) on LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!") · 2024-04-26T15:34:01.451Z · LW · GW

I'm generally not a fan of increasing the amount of illegible selection effects.

On the privacy side, can lesswrong guarantee that, if I never click on Recommended, then recombee will never see an (even anonymized) trace of what I browse on lesswrong?

Comment by carado-1 on [deleted post] 2024-04-22T20:07:17.629Z

Here the thing that I'm calling evil is pursuing short-term profits at the cost of non-negligeably higher risk that everyone dies.

Comment by carado-1 on [deleted post] 2024-04-21T17:01:11.669Z

Regardless of how good their alignment plans are, the thing that makes OpenAI unambiguously evil is that they created a strongly marketed public product and, as a result, caused a lot public excitement about AI, and thus lots of other AI capabilities organizations were created that are completely dismissive of safety.

There's just no good reason to do that, except short-term greed at the cost of higher probability that everyone (including people at OpenAI) dies.

(No, "you need huge profits to solve alignment" isn't a good excuse — we had nowhere near exhausted the alignment research that can be done without huge profits.)

Comment by Tamsin Leake (carado-1) on What convincing warning shot could help prevent extinction from AI? · 2024-04-14T06:20:25.187Z · LW · GW

There's also the case of harmful warning shots: for example, if it turns out that, upon seeing an AI do a scary but impressive thing, enough people/orgs/states go "woah, AI is powerful, I should make one!" or "I guess we're doomed anyways, might as well stop thinking about safety and just enjoy making profit with AI while we're still alive", to offset the positive effect. This is totally the kind of thing that could be the case in our civilization.

Comment by carado-1 on [deleted post] 2024-04-14T04:30:06.718Z

There could be a difference but only after a certain point in time, which you're trying to predict / plan for.

Comment by carado-1 on [deleted post] 2024-04-13T20:20:10.761Z

What you propose, ≈"weigh indices by kolmogorov complexity" is indeed a way to go about picking indices, but "weigh indices by one over their square" feels a lot more natural to me; a lot simpler than invoking the universal prior twice.

Comment by carado-1 on [deleted post] 2024-04-13T17:52:44.269Z

If you use the UTMs for cartesian-framed inputs/outputs, sure; but if you're running the programs as entire worlds, then you still have the issue of "where are you in time".

Say there's an infinitely growing conway's-game-of-life program, or some universal program, which contains a copy of me at infinitely many locations. How do I weigh which ones are me?

It doesn't matter that the UTM has a fixed amount of weight, there's still infinitely many locations within it.

Comment by carado-1 on [deleted post] 2024-04-13T11:47:48.333Z

(cross-posted from my blog)

Is quantum phenomena anthropic evidence for BQP=BPP? Is existing evidence against many-worlds?

Suppose I live inside a simulation ran by a computer over which I have some control.

Scenario 1: I make the computer run the following:
```
pause simulation

if is even(calculate billionth digit of pi):
	resume simulation
```
Suppose, after running this program, that I observe that I still exist. This is some anthropic evidence for the billionth digit of pi being even.

Thus, one can get anthropic evidence about logical facts.
Scenario 2: I make the computer run the following:
```
  pause simulation
  
  if is even(calculate billionth digit of pi):
  	resume simulation
  else:
  	resume simulation but run it a trillion times slower
```
If you're running on the non-time-penalized solomonoff prior, then that's no evidence at all — observing existing is evidence that you're being ran, not that you're being ran fast. But if you do that, a bunch of things break including anthropic probabilities and expected utility calculations. What you want is a time-penalized (probably quadratically) prior, in which later compute-steps have less realityfluid than earlier ones — and thus, observing existing is evidence for being computed early — and thus, observing existing is some evidence that the billionth digit of pi is even.
Scenario 3: I make the computer run the following:
```
  pause simulation

  quantum_algorithm <- classical-compute algorithm which simulates quantum algorithms the fastest

  infinite loop:
  	use quantum_algorithm to compute the result of some complicated quantum phenomena

  	compute simulation forwards by 1 step
```
Observing existing after running this program is evidence that BQP=BPP — that is, classical computers can efficiently run quantum algorithms: if BQP≠BPP, then my simulation should become way slower, and existing is evidence for being computed early and fast (see scenario 2).

Except, living in a world which contains the outcome of cohering quantum phenomena (quantum computers, double-slit experiments, etc) is very similar to the scenario above! If your prior for the universe is a programs, penalized for how long they take to run on classical computation, then observing that the outcome of quantum phenomena is being computed is evidence that they can be computed efficiently.
Scenario 4: I make the computer run the following:
```
  in the simulation, give the human a device which generates a sequence of random bits
  pause simulation

  list_of_simulations <- [current simulation state]

  quantum_algorithm <- classical-compute algorithm which simulates quantum algorithms the fastest

  infinite loop:
  	list_of_new_simulations <- []
  	
  	for simulation in list_of_simulations:
  		list_of_new_simulations += 
  			[ simulation advanced by one step where the device generated bit 0,
  			  simulation advanced by one step where the device generated bit 1 ]

  	list_of_simulations <- list_of_new_simulations
```
This is similar to what it's like to being in a many-worlds universe where there's constant forking.

Yes, in this scenario, there is no "mutual destruction", the way there is in quantum. But with decohering everett branches, you can totally build exponentially many non-mutually-destructing timelines too! For example, you can choose to make important life decisions based on the output of the RNG, and end up with exponentially many different lives each with some (exponentially little) quantum amplitude, without any need for those to be compressible together, or to be able to mutually-destruct. That's what decohering means! "Recohering" quantum phenomena interacts destructively such that you can compute the output, but decohering* phenomena just branches.

The amount of different simulations that need to be computed increases exponentially with simulation time.

Observing existing after running this program is very strange. Yes, there are exponentially many me's, but all of the me's are being ran exponentially slowly; they should all not observe existing. I should not be any of them.

This is what I mean by "existing is evidence against many-worlds" — there's gotta be something like an agent (or physics, through some real RNG or through computing whichever variables have the most impact) picking a only-polynomially-large set of decohered non-compressible-together timelines to explain continuing existing.

Some friends tell me "but tammy, sure at step N each you has only 1/2^N quantum amplitude, but at step N there's 2^N such you's, so you still have 1 unit of realityfluid" — but my response is "I mean, I guess, sure, but regardless of that, step N occurs 2^N units of classical-compute-time in the future! That's the issue!".

Some notes:

I heard about pilot wave theory recently, and sure, if that's one way to get single history, why not. I hear that it "doesn't have locality", which like, okay I guess, that's plausibly worse program-complexity wise, but it's exponentially better after accounting for the time penalty.
What if "the world is just Inherently Quantum"? Well, my main answer here is, what the hell does that mean? It's very easy for me to imagine existing inside of a classical computation (eg conway's game of life); I have no idea what it'd mean for me to exist in "one of the exponentially many non-compressible-together decohered exponenially-small-amplitude quantum states that are all being computed forwards". Quadratically-decaying-realityfluid classical-computation makes sense, dammit.
What if it's still true — what if I am observing existing with exponentially little (as a function of the age of the universe) realityfluid? What if the set of real stuff is just that big?

Well, I guess that's vaguely plausible (even though, ugh, that shouldn't be how being real works, I think), but then the tegmark 4 multiverse has to contain no hypotheses in which observers in my reference class occupy more than exponentially little realityfluid.

Like, if there's a conway's-game-of-life simulation out there in tegmark 4, whose entire realityfluid-per-timestep is equivalent to my realityfluid-per-timestep, then they can just bruteforce-generate all human-brain-states and run into mine by chance, and I should have about as much probability of being one of those random generations as I'd have being in this universe — both have exponentially little of their universe's realityfluid! The conway's-game-of-life bruteforced-me has exponentially little realityfluid because she's getting generated exponentially late, and quantum-universe me has exponentially little realityfluid because I occupy exponentially little of the quantum amplitude, at every time-step.

See why that's weird? As a general observer, I should exponentially favor observing being someone who lives in a world where I don't have exponentially little realityfluid, such as "person who lives only-polynomially-late into a conway's-game-of-life, but happened to get randomly very confused about thinking that they might inhabit a quantum world".

Existing inside of a many-worlds quantum universe feels like aliens pranksters-at-orthogonal-angles running the kind of simulation where the observers inside of it to be very anthropically confused once they think about anthropics hard enough. (This is not my belief.)

Comment by Tamsin Leake (carado-1) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T11:56:22.973Z · LW · GW

I didn't see a clear indication in the post about whether the music is AI-generated or not, and I'd like to know; was there an indication I missed?

(I care because I'll want to listen to that music less if it's AI-generated.)

Comment by Tamsin Leake (carado-1) on On expected utility, part 1: Skyscrapers and madmen · 2024-03-31T06:14:21.947Z · LW · GW

Unlike on your blog, the images on the lesswrong version of this post are now broken.

Comment by Tamsin Leake (carado-1) on Orthogonality Thesis seems wrong · 2024-03-25T13:13:59.811Z · LW · GW

Taboo the word "intelligence".

An agent can superhumanly-optimize any utility function. Even if there are objective values, a superhuman-optimizer can ignore them and superhuman-optimize paperclips instead (and then we die because it optimized for that harder than we optimized for what we want).

Comment by carado-1 on [deleted post] 2024-03-21T09:44:37.430Z

(I'm gonna interpret these disagree-votes as "I also don't think this is the case" rather than "I disagree with you tamsin, I think this is the case".)

Comment by carado-1 on [deleted post] 2024-03-20T13:25:50.633Z

I don't think this is the case, but I'm mentioning this possibility because I'm surprised I've never seen someone suggest it before:

Maybe the reason Sam Altman is taking decisions that increase p(doom) is because he's a pure negative utilitarian (and he doesn't know-about/believe-in acausal trade).

Comment by Tamsin Leake (carado-1) on Toki pona FAQ · 2024-03-18T14:40:48.072Z · LW · GW

For writing, there's also jan misali's ASCII toki pona syllabary.

Comment by carado-1 on [deleted post] 2024-03-16T17:56:09.067Z

Reposting myself from discord, on the topic of donating 5000$ to EA causes.

if you're doing alignment research, even just a bit, then the 5000$ are probly better spent on yourself

if you have any gears level model of AI stuff then it's better value to pick which alignment org to give to yourself; charity orgs are vastly understaffed and you're essentially contributing to the "picking what to donate to" effort by thinking about it yourself

if you have no gears level model of AI then it's hard to judge which alignment orgs it's helpful to donate to (or, if giving to regranters, which regranters are good at knowing which alignment orgs to donate to)

as an example of regranters doing massive harm: openphil gave 30M$ to openai at a time where it was critically useful to them, (supposedly in order to have a chair on their board, and look how that turned out when the board tried to yeet altman)

i know of at least one person who was working in regranting and was like "you know what i'd be better off doing alignment research directly" — imo this kind of decision is probly why regranting is so understaffed

it takes technical knowledge to know what should get money, and once you have technical knowledge you realize how much your technical knowledge could help more directly so you do that, or something

Comment by Tamsin Leake (carado-1) on More people getting into AI safety should do a PhD · 2024-03-15T09:41:37.784Z · LW · GW

yes, edited

User info

Posts

Comments

decision theory is no substitute for utility function

Is quantum phenomena anthropic evidence for BQP=BPP? Is existing evidence against many-worlds?