Posts

What's important in "AI for epistemics"? 2024-08-24T01:27:06.771Z
Project ideas: Backup plans & Cooperative AI 2024-01-08T17:19:33.181Z
Project ideas: Sentience and rights of digital minds 2024-01-07T17:34:58.942Z
Project ideas: Epistemics 2024-01-05T23:41:23.721Z
Project ideas: Governance during explosive technological growth 2024-01-04T23:51:56.407Z
Non-alignment project ideas for making transformative AI go well 2024-01-04T07:23:13.658Z
Memo on some neglected topics 2023-11-11T02:01:55.834Z
Implications of evidential cooperation in large worlds 2023-08-23T00:43:45.232Z
PaLM-2 & GPT-4 in "Extrapolating GPT-N performance" 2023-05-30T18:33:40.765Z
Some thoughts on automating alignment research 2023-05-26T01:50:20.099Z
Before smart AI, there will be many mediocre or specialized AIs 2023-05-26T01:38:41.562Z
PaLM in "Extrapolating GPT-N performance" 2022-04-06T13:05:12.803Z
Truthful AI: Developing and governing AI that does not lie 2021-10-18T18:37:38.325Z
OpenAI: "Scaling Laws for Transfer", Hernandez et al. 2021-02-04T12:49:25.704Z
Prediction can be Outer Aligned at Optimum 2021-01-10T18:48:21.153Z
Extrapolating GPT-N performance 2020-12-18T21:41:51.647Z
Formalising decision theory is hard 2019-08-23T03:27:24.757Z
Quantifying anthropic effects on the Fermi paradox 2019-02-15T10:51:04.298Z

Comments

Comment by Lukas Finnveden (Lanrian) on Should you go with your best guess?: Against precise Bayesianism and related views · 2025-02-01T05:29:38.008Z · LW · GW

Also, my sense is that many people are making decisions based on similar intuitions as the ones you have (albeit with much less of a formal argument for how this can be represented or why it's reasonable). In particular, my impression is that people who are are uncompelled by longtermism (despite being compelled by some type of scope-sensitive consequentialism) are often driven by an aversion to very non-robust EV-estimates.

Comment by Lukas Finnveden (Lanrian) on Should you go with your best guess?: Against precise Bayesianism and related views · 2025-02-01T05:25:46.380Z · LW · GW

If I were to write the case for this in my own words, it might be something like:

  • There are many different normative criteria we should give some weight to.
  • One of them is "maximizing EV according to moral theory A".
  • But maximizing EV is an intuitively less appealing normative criteria when (i) it's super unclear and non-robust what credences we ought to put on certain propositions, and (ii) the recommended decision is very different depending on what our exact credences on those propositions are.
  • So in such cases, as a matter of ethics, you might have the intuition that you should give less weight to "maximize EV according to moral theory A" and more weight to e.g.:
    • Deontic criteria that don't use EV.
    • EV-maximizing according to moral theory B (where B's recommendations are less sensitive to the propositions that are difficult to put robust credences on).
    • EV-maximizing within a more narrow "domain", ignoring the effects outside of that "domain". (Where the effects within that "domain" are less sensitive to the propositions that are difficult to put robust credences on).

I like this formulation because it seems pretty arbitrary to me where you draw the boundary between a credence that you include in your representor vs. not. (Like: What degree of justification is enough? We'll always have the problem of induction to provide some degree of arbitrariness.) But if we put this squarely in the domain of ethics, I'm less fuzzed about this, because I'm already sympathetic to being pretty anti-realist about ethics, and there being some degree of arbitrariness in choosing what you care about. (And I certainly feel some intuitive aversion to making choices based on very non-robust credences, and it feels interesting to interpret that as an ~ethical intuition.)

Comment by Lukas Finnveden (Lanrian) on Winning isn't enough · 2025-02-01T04:23:55.727Z · LW · GW

Just to confirm, this means that the thing I put in quotes would probably end up being dynamically inconsistent? In order to avoid that, I need to put in an additional step of also ruling out plans that would be dominated from some constant prior perspective? (It’s a good point that these won’t be dominated from my current perspective.)

Comment by Lukas Finnveden (Lanrian) on Winning isn't enough · 2025-01-31T05:31:18.196Z · LW · GW

One upshot of this is that you can follow an explicitly non-(precise-)Bayesian decision procedure and still avoid dominated strategies. For example, you might explicitly specify beliefs using imprecise probabilities and make decisions using the “Dynamic Strong Maximality” rule, and still be immune to sure losses. Basically, Dynamic Strong Maximality tells you which plans are permissible given your imprecise credences, and you just pick one. And you could do this “picking” using additional substantive principles. Maybe you want to use another rule for decision-making with imprecise credences (e.g., maximin expected utility or minimax regret). Or maybe you want to account for your moral uncertainty (e.g., picking the plan that respects more deontological constraints).

Let's say Alice have imprecise credences. Let's say Alice follows the algorithm: "At each time-step t, I will use 'Dynamic Strong Maximality' to find all plans that aren't dominated. I will pick between them using [some criteria]. Then I will take the action that plan recommends." (And then at the next timestep t+1, you re-do everything I just said in the quotes.)

If Alice does this, does she ended up being dynamically inconsistent? (Vulnerable to dutch-books etc.)

(Maybe it varies depending on the criteria. I'm interested if you have a hunch for what the answer will be for the sort of criteria you listed: maximin expected utility, minimax regret, picking the plan that respects more deontological constraints.)

I.e., I'm interested in: If you want to use dynamic strong maximality to avoid dominated strategies, does that require you to either have the ability to commit to a plan or the inclination to consistently pick your plan from some prior epistemic perspective. (Like an "updateless" agent might.) Or do you automatically avoid dominated strategies even if you're constantly recomputing your plan?

Comment by Lukas Finnveden (Lanrian) on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-01-30T00:17:42.586Z · LW · GW

if the trend toward long periods of internal-only deployment continues

Have we seen such a trend so far? I would have thought the trend to date was neutral or towards shorter period of internal-only deployment.

Tbc, not really objecting to your list of reasons why this might change in the future. One thing I'd add to it is that even if calendar-time deployment delays don't change, the gap in capabilities inside vs. outside AI companies could increase a lot if AI speeds up the pace of AI progress.

ETA: Dario Amodei says "Sonnet's training was conducted 9-12 months ago". He doesn't really clarify whether he's talking the "old" or "new" 3.5. Old and new sonnet were released in mid-June and mid-October, so 7 and 3 months ago respectively. Combining the 3 vs. 7 months options with the 9-12 months range imply 2, 5, 6, or 9 months of keeping it internal. I think for GPT-4, pretraining ended in August and it was released in March, so that's 7 months from pre-training to release. So that's probably on the slower side of Claude possibilities if Dario was talking about pre-training ending 9-12 months ago. But probably faster than Claude if Dario was talking about post-training finishing that early.

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2025-01-22T20:16:55.214Z · LW · GW

Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.

I'm confused — I thought you put significantly less probability on software-only singularity than Ryan does? (Like half?) Maybe you were using a different bound for the number of OOMs of improvement?

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2025-01-21T20:20:06.066Z · LW · GW

In practice, we'll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don't currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around .

Interesting comparison point: Tom thought this would give a way larger boost in his old software-only singularity appendix.

When considering an "efficiency only singularity", some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that "for each x% increase in cumulative R&D inputs, the output metric will increase by r*x". The condition for increasing returns is r>1.)

Whereas when including capability improvements:

I said I was 50-50 on an efficiency only singularity happening, at least temporarily. Based on these additional considerations I’m now at more like ~85% on a software only singularity. And I’d guess that initially r = ~3 (though I still think values as low as 0.5 or as high as 6 as plausible). There seem to be many strong ~independent reasons to think capability improvements would be a really huge deal compared to pure efficiency problems, and this is borne out by toy models of the dynamic. 

Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity. (ETA: Or maybe not? See other comments in this thread.)

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2025-01-21T20:06:09.263Z · LW · GW

Based on some guesses and some poll questions, my sense is that capabilities researchers would operate about 2.5x slower if they had 10x less compute (after adaptation)

Can you say roughly who the people surveyed were? (And if this was their raw guess or if you've modified it.)

I saw some polls from Daniel previously where I wasn't sold that they were surveying people working on the most important capability improvements, so wondering if these are better.

Also, somewhat minor, but: I'm slightly concerned that surveys will overweight areas where labor is more useful relative to compute (because those areas should have disproportionately many humans working on them) and therefore be somewhat biased in the direction of labor being important.

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2025-01-21T18:23:04.360Z · LW · GW

Hm — what are the "plausible interventions" that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for "plausible", because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2025-01-21T05:47:07.968Z · LW · GW

Is there some reason for why current AI isn't TCAI by your definition?

(I'd guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they're being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don't think that's how you meant it.)

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2025-01-21T03:08:53.841Z · LW · GW

I'm not sure if the definition of takeover-capable-AI (abbreviated as "TCAI" for the rest of this comment) in footnote 2 quite makes sense. I'm worried that too much of the action is in "if no other actors had access to powerful AI systems", and not that much action is in the exact capabilities of the "TCAI". In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption "no other actor will have access to powerful AI systems", they'd have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it'd be right to forecast a >25% chance of them successfully taking over if they were motivated to try.

And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: "Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would" and "via assisting the developers in a power grab, or via partnering with a US adversary". (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn't agentic enough to "assist"/"partner" with allies as supposed to just be used as a tool?)

 

What could a competing definition be? Thinking about what we care most about... I think two events especially stand out to me:

  • When would it plausibly be catastrophically bad for an adversary to steal an AI model?
  • When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?

Maybe a better definition would be to directly talk about these two events? So for example...

  1. "Steal is catastrophic" would be true if...
    1. "Frontier AI development projects immediately acquire good enough security to keep future model weights secure" has significantly less probability of AI-assisted takeover than
    2. "Frontier AI development projects immediately have their weights stolen, and then acquire security that's just as good as in (1a)."[1]
  2. "Power-seeking and non-controlled is catastrophic" would be true if...
    1. "Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would've been net-negative for them to deploy" has significantly less probability of AI-assisted takeover than
    2. "Frontier AI development acquire the level of judgment described in (2a) 6 months later."[2]

Where "significantly less probability of AI-assisted takeover" could be e.g. at least 2x less risk.

  1. ^

    The motivation for assuming "future model weights secure" in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn't nullified by the fact that they're very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can't contrast 1a'="model weights are permanently secure" with 1b'="model weights get stolen and are then default-level-secure", because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren't that important.)

  2. ^

    The motivation for assuming "good future judgment about power-seeking-risk" is similar to the motivation for assuming "future model weights secure" above. The motivation for choosing "good judgment about when to deploy vs. not" rather than "good at aligning/controlling future models" is that a big threat model is "misaligned AIs outcompete us because we don't have any competitive aligned AIs, so we're stuck between deploying misaligned AIs and being outcompeted" and I don't want to assume away that threat model.

Comment by Lukas Finnveden (Lanrian) on What are the strongest arguments for very short timelines? · 2025-01-10T18:03:32.920Z · LW · GW

Ok, gotcha.

It's that she didn't accept the reasoning behind that number enough to really believe it. She added a discount factor based on fallacious reasoning around "if it were that easy, it'd be here already".

Just to clarify: There was no such discount factor that changed the median estimate of "human brain compute". Instead, this discount factor was applied to go from "human brain compute estimate" to "human-brain-compute-informed estimate of the compute-cost of training TAI with current algorithms" — adjusting for how our current algorithm seem to be worse than those used to run the human brain. (As you mention and agree with, although I infer that you expect algorithmic progress to be faster than Ajeya did at the time.) The most relevant section is here.

Comment by Lukas Finnveden (Lanrian) on Before smart AI, there will be many mediocre or specialized AIs · 2025-01-06T14:14:09.943Z · LW · GW

I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of:

  • "FLOP until human-interpretable information bottleneck". While models still think in English, and doesn't know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
  • "FLOP until feedback" — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it?
    • Models will probably be trained on a mixture of different regimes here. E.g.: "FLOP until feedback" being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training.
    • So if you want to collapse it to one metric, you'd want to somehow weight by number of data-points and sample efficiency for each type of training.
  • "FLOP until outcome-based feedback" — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment.

Having higher "FLOP until X" (for each of the X in the 3 bullet points) seems to increase danger. While increasing "total FLOP spent on inference" seems to have a much better ratio of increased usefulness : increased danger.

 

In this framing, I think:

  • Based on what we saw of o1's chain-of-thoughts, I'd guess it hasn't changed "FLOP until human-interpretable information bottleneck", but I'm not sure about that.
  • It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase "FLOP until feedback".
  • Not sure what type of feedback they use. I'd guess that the most outcome-based thing they do is "executing code and seeing whether it passes test".
Comment by Lukas Finnveden (Lanrian) on Before smart AI, there will be many mediocre or specialized AIs · 2025-01-06T13:04:52.197Z · LW · GW

It's possible that "many mediocre or specialized AIs" is, in practice, a bad summary of the regime with strong inference scaling. Maybe people's associations with "lots of mediocre thinking" ends up being misleading.

Comment by Lukas Finnveden (Lanrian) on Before smart AI, there will be many mediocre or specialized AIs · 2025-01-06T13:00:12.873Z · LW · GW

Thanks!

I agree that we've learned interesting new things about inference speeds. I don't think I would have anticipated that at the time.

Re:

It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it's not clear that the relevant regime will look like "lots of mediocre thinking".[1]

There are versions of this that I'd still describe as "lots of mediocre thinking" —adding up to being similarly useful as higher-quality thinking.

(C.f. above from the post: "the collective’s intelligence will largely come from [e.g.] Individual systems 'thinking' for a long time, churning through many more explicit thoughts than a skilled human would need to solve a problem" & "Assuming that much of this happens 'behind the scenes', a human interacting with this system might just perceive it as a single super-smart AI.) 

The most relevant question is whether we'll still get the purported benefits of the lots-of-mediocre-thinking-regime if there's strong inference scaling. I think we probably do.

Paraphrasing my argument in the "Implications" section:

  • If we don't do much end-to-end training of models thinking a lot, then supervision will be pretty easy. (Even if the models think for a long time, it will all be in English, and each leap-of-logic will be weak compared to what the human supervisors can do.)
  • End-to-end training of models thinking a lot is expensive. So maybe we won't do it by default, or maybe it will be an acceptable alignment tax to  avoid it. (Instead favoring "process-based" methods as the term is used in this post.)
  • Even if we do end-to-end training of models thinking a lot, the model's "thinking" might still remain pretty interpretable to humans in practice.
  • If models produce good recommendations by thinking a lot in either English or something similar to English, then there ought to be a translation/summary of that argument which humans can understand. Then, even if we're giving the models end-to-end feedback, we could give them feedback based on whether humans recognize the argument as good, rather than by testing the recommendation and seeing whether it leads to good results in the real world. (This comment discusses this distinction. Confusingly, this is sometimes referred to as "process-based feedback" as opposed to "outcomes-based feedback", despite it being slightly different from the concept two bullet points up. )

I think o3 results might involve enough end-to-end training to mostly contradict the hopes of bullet points 1-2. But I'd guess it doesn't contradict 3-4.

(Another caveat that I didn't have in the post is that it's slightly tricker to supervise mediocre serial thinking than mediocre parallel thinking, because you may not be able to evaluate a random step in the middle without loading up on earlier context. But my guess is that you could train AIs to help you with this without adding too much extra risk.)

Comment by Lukas Finnveden (Lanrian) on What are the strongest arguments for very short timelines? · 2024-12-25T19:11:40.214Z · LW · GW

One argument I have been making publicly is that I think Ajeya's Bioanchors report greatly overestimated human brain compute. I think a more careful reading of Joe Carlsmith's report that hers was based on supports my own estimates of around 1e15 FLOPs.

Am I getting things mixed up, or isn’t that just exactly Ajeya’s median estimate? Quote from the report: ”Under this definition, my median estimate for human brain computation is ~1e15 FLOP/s.”

https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit

Comment by Lukas Finnveden (Lanrian) on Anthropic leadership conversation · 2024-12-22T07:11:55.583Z · LW · GW

We did the the 80% pledge thing, and that was like a thing that everybody was just like, "Yes, obviously we're gonna do this."

Does anyone know what this is referring to? (Maybe a pledge to donate 80%? If so, curious about 80% of what & under what conditions.)

Comment by Lukas Finnveden (Lanrian) on Ayn Rand’s model of “living money”; and an upside of burnout · 2024-12-02T00:00:42.793Z · LW · GW

Related: The monkey and the machine by Paul Christiano. (Bottom-up processes ~= monkey. Verbal planner ~= deliberator. Section IV talks about the deliberator building trust with the monkey.)

A difference between this essay and Paul's is that this one seems to lean further towards "a good state is one where the verbal planner ~only spends attention on things that the bottom-up processes care about", whereas Paul's essay suggests a compromise where the deliberator gets to spend a good chunk of attention on things that the monkey doesn't care about. (In Rand's metaphor, I guess this would be like using some of your investment returns for consumption. Where consumption would presumably count as a type of dead money, although the connotations don't feel exactly right, so maybe it should be in a 3rd bucket.)

Comment by Lukas Finnveden (Lanrian) on Double-Dipping in Dunning--Kruger · 2024-10-27T00:08:36.800Z · LW · GW

Here's the best explanation + study I've seen of Dunning-Krueger-ish graphs: https://www.clearerthinking.org/post/is-the-dunning-kruger-effect-real-or-are-unskilled-people-more-rational-than-it-seems 

Their analysis suggests that their data is pretty well-explained by a combination of a "Closer-To-The-Average Effect" (which may or may not be rational — there are multiple possible rational reasons for it) and a "Better-Than-Average Effect" that appear ~uniformly across the board (but getting swamped by the "closer-to-the-average effect" at the upper end).

Comment by Lukas Finnveden (Lanrian) on IAPS: Mapping Technical Safety Research at AI Companies · 2024-10-25T17:14:30.511Z · LW · GW

probably research done outside of labs has produced more differential safety progress in total

To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)

Comment by Lukas Finnveden (Lanrian) on Sabotage Evaluations for Frontier Models · 2024-10-21T04:21:30.557Z · LW · GW

There's at least two different senses in which "control" can "fail" for a powerful system:

  • Control evaluations can indicate that there's no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
  • Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there's a high probability of catastrophe.

My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess would be that they're not saying that well-designed control evaluations become untrustworthy — just that they'll stop promising you safety.

But to be clear: In this question, you're asking about something more analogous to the second case, right? (Sabotage/sandbagging evaluations being misleading about models' actual capabilities at sabotage & sandbagging?)

My question posed in other words: Would you count "evaluations clearly say that models can sabotage & sandbag" as success or failure?

Comment by Lukas Finnveden (Lanrian) on Daniel Kokotajlo's Shortform · 2024-10-16T15:51:03.885Z · LW · GW

https://cdn.openai.com/spec/model-spec-2024-05-08.html

Comment by Lukas Finnveden (Lanrian) on Dario Amodei — Machines of Loving Grace · 2024-10-13T22:37:05.094Z · LW · GW

More generally, Dario appears to assume that for 5-10 years after powerful AI we'll just have a million AIs which are a bit smarter than the smartest humans and perhaps 100x faster rather than AIs which are radically smarter, faster, and more numerous than humans. I don't see any argument that AI progress will stop at the point of top humans rather continuing much further.

Well, there's footnote 10:

Another factor is of course that powerful AI itself can potentially be used to create even more powerful AI. My assumption is that this might (in fact, probably will) occur, but that its effect will be smaller than you might imagine, precisely because of the “decreasing marginal returns to intelligence” discussed here. In other words, AI will continue to get smarter quickly, but its effect will eventually be limited by non-intelligence factors, and analyzing those is what matters most to the speed of scientific progress outside AI.

So his view seems to be that even significantly smarter AIs just wouldn't be able to accomplish that much more than what he's discussing here. Such that they're not very relevant.

(I disagree. Maybe there are some hard limits, here, but maybe there's not. For most of the bottlenecks that Dario discusses, I don't know how you become confident that there are 0 ways to speed them up or circumvent them. We're talking about putting in many times more intellectual labor than our whole civilization has spent on any topic to date.)

Comment by Lukas Finnveden (Lanrian) on My motivation and theory of change for working in AI healthtech · 2024-10-13T20:53:15.307Z · LW · GW

I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest concerns about "AI for epistemics" is that we might not have much time to get good use of the epistemic assistance before the end — but if the idea is that we'll have AGI for many years (as we're gradually heading towards extinction) then there will be plenty of time.

Comment by Lukas Finnveden (Lanrian) on Applications of Chaos: Saying No (with Hastings Greer) · 2024-09-23T05:32:08.567Z · LW · GW

and as he admits in the footnote he didn't include in the LW version, in real life, when adequately incentivized to win rather than find excuses involving 'well, chaos theory shows you can't predict ball bounces more than n bounces out', pinball pros learn how to win and rack up high scores despite 'muh chaos'.

I was confused about this part of your comment because the post directly talks about this in the conclusion.

The strategy typically is to catch the ball with the flippers, then to carefully hit the balls so that it takes a particular ramp which scores a lot of points and then returns the ball to the flippers. Professional pinball players try to avoid the parts of the board where the motion is chaotic.

The "off-site footnote" you're referring to seems to just be saying "The result is a pretty boring game. However, some of these ramps release extra balls after you have used them a few times. My guess is that this is the game designer trying to reintroduce chaos to make the game more interesting again." which is just a minor detail. AFAICT pros could score lots of points even without the extra balls.

(I'm leaving this comment here because I was getting confused about whether there had been major edits to the post, since the relevant content is currently in the conclusion and not the footnote. I was digging through the wayback machine and didn't see any major edits. So trying to save other people from the same confusion.)

Comment by Lukas Finnveden (Lanrian) on Would catching your AIs trying to escape convince AI developers to slow down or undeploy? · 2024-09-07T15:47:33.076Z · LW · GW

Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.

What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?

Comment by Lukas Finnveden (Lanrian) on Perplexity wins my AI race · 2024-08-24T19:52:31.877Z · LW · GW

Have you tried using different AI models within perplexity? Any ideas about which one is best? I don't know whether to expect better results from Sonnet 3.5 (within perplexity) or one of the models that perplexity have finetuned themselves, like Sonar Huge.

Comment by Lukas Finnveden (Lanrian) on An AI Race With China Can Be Better Than Not Racing · 2024-08-24T15:31:48.909Z · LW · GW

To be clear, uncertainty about the number of iterations isn’t enough. You need to have positive probability on arbitrarily high numbers of iterations, and never have it be the case that the probability of p(>n rounds) is so much less than p(n rounds) that it’s worth defecting on round n regardless of the effect of your reputation. These are pretty strong assumptions.

So cooperation is crucially dependent on your belief that all the way from 10 rounds to Graham’s number of rounds (and beyond), the probability of >n rounds conditional on n rounds is never lower than e.g. 20% (or whatever number is implied by the pay-off structure of your game).

Comment by Lukas Finnveden (Lanrian) on In Defense of Open-Minded UDT · 2024-08-12T20:29:20.453Z · LW · GW

it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.

I would guess that assumption would be sufficient to defeat my counter-example, yeah.

I do think this is a big assumption. Definitely not one that I'd want to generally assume for practical purposes, even if it makes for a nicer theory of decision theory. But it would be super interesting if someone could make a proper defense of it typically being true in practice.

E.g.: Is it really true that a human's decision about whether or not to program a seed AI to take action A has the same correlations as that same superintelligence deciding whether or not to take action A 1000 years later while using a jupiter brain for its computation? Intuitively, I'd say that the human would correlate mostly with other humans and other evolved species, and that the superintelligence would mostly correlate with other superintelligences, and it'd be a big deal if that wasn't true.

Comment by Lukas Finnveden (Lanrian) on In Defense of Open-Minded UDT · 2024-08-12T19:18:01.406Z · LW · GW

However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.

I think this conjecture is probably false for reasons described in this section of "When does EDT seek evidence about correlations?". The section offers an argument for why son-of-EDT isn't UEDT, but I think it generalizes to an argument for why son-of-UEDT isn't UEDT.

Briefly: UEDT-at-timestep-1 is making a different decision than UEDT-at-timestep-0. This means that its decision might be correlated (according to the prior) with some facts which UEDT-at-timestep-0's decision isn't correlated with. From the perspective of UEDT-at-timestep-0, it's bad to let UEDT-at-timestep-1 make decisions on the basis of correlations with things that UEDT-at-timestep-0 can't control.

Comment by Lukas Finnveden (Lanrian) on In Defense of Open-Minded UDT · 2024-08-12T19:06:13.972Z · LW · GW

Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).

I don't understand this argument.

"an agent eventually behaves as if it were applying UDT with each Pn" — why can't an agent skip over some Pn entirely or get stuck on P9 or whatever?

"Therefore, in particular, it eventually behaves like UDT with prior P0." even granting the above — sure, it will beahve like UDT with prior p0 at some point. But then after that it might have some other prior. Why would it stick with P0?

Comment by Lukas Finnveden (Lanrian) on DeepMind: Evaluating Frontier Models for Dangerous Capabilities · 2024-08-08T18:12:26.636Z · LW · GW

Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn't find this in the paper, sorry if I missed it.)

Comment by Lukas Finnveden (Lanrian) on Dragon Agnosticism · 2024-08-02T02:35:44.313Z · LW · GW

Tbc: It should be fine to argue against those implications, right? It’s just that, if you grant the implication, then you can’t publicly refute Y.

Comment by Lukas Finnveden (Lanrian) on Non-Disparagement Canaries for OpenAI · 2024-06-05T00:39:11.663Z · LW · GW

I also like Paul's idea (which I can't now find the link for) of having labs make specific "underlined statements" to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements

Link: https://sideways-view.com/2018/02/01/honest-organizations/

Comment by Lukas Finnveden (Lanrian) on EDT with updating double counts · 2024-04-09T03:03:25.747Z · LW · GW

Maybe interesting: I think a similar double-counting problem would appear naturally if you tried to train an RL agent in a setting where:

  • "Reward" is proportional to an estimate of some impartial measure of goodness.
  • There are multiple identical copies of your RL algorithm (including: they all use the same random seed for exploration).

In a repeated version of the calculator example (importantly: where in each iteration, you randomly decide whether the people who saw "true" get offered a bet or the people who saw "false" get offered a bet — never both), the RL algorithms would learn that, indeed:

  • 99% of the time, they're in the group where the calculator doesn't make an error
  • and on average, when they get offered a bet, they will get more reward afterwards if they take it than if they don't.

The reason that this happens is because, when the RL agents lose money, there's fewer agents that associate negative reinforcement with having taken a bet just-before. Whereas whenever they gain money, there's more agents that associate positive reinforcement with having taken a bet just-before. So the total amount of reinforcement is greater in the latter case, so the RL agents learn to bet. (Despite how this loses them money on average.)

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2024-02-07T01:38:09.059Z · LW · GW

I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.

Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn't have great robot control over. And mass-killing isn't necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it's not clear where the pragmatics point.

(Main thing I was reacting to in my above comment was Steven's scenario where the AI already has many copies across the solar system, already has robot armies, and is contemplating how to send firmware updates. I.e. it seemed more like a scenario of "holding on in the long-term" than "how to initially establish control and survive". Where I feel like the surveillance scenarios are probably stable.)

Comment by Lukas Finnveden (Lanrian) on ryan_greenblatt's Shortform · 2024-02-07T00:19:35.990Z · LW · GW

What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can't (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)

Comment by Lukas Finnveden (Lanrian) on Matthew Barnett's Shortform · 2024-01-29T19:42:21.771Z · LW · GW

But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.

Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over a lot of power in a short amount of time", (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power.

I'm interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time. (Partly for prediction purposes, partly so we can consider implementing it, in some scenarios.) If AIs were handed the rights to own property, but didn't participate in political decision-making, and then accumulated >95% of capital within a few years, then I think there's a serious risk that human governments would tax/expropriate that away. Including them in political decision-making would require some serious innovation in government (e.g. scrapping 1-person 1-vote) which makes it feel less to me like it'd be a smooth transition that inherits a lot from previous institutions, and more like an abrupt negotiated deal which might or might not turn out to be stable.

Comment by Lukas Finnveden (Lanrian) on Matthew Barnett's Shortform · 2024-01-28T23:11:00.633Z · LW · GW

(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.)

Point previously made in:

"security and stability" section of propositions concerning digital minds and society:

If wars, revolutions, and expropriation events continue to happen at historically typical intervals, but on digital rather than biological timescales, then a normal human lifespan would require surviving an implausibly large number of upheavals; human security therefore requires the establishment of ultra-stable peace and socioeconomic protections.

There's also a similar point made in the age of em, chapter 27:

This protection of human assets, however, may only last for as long as the em civilization remains stable. After all, the typical em may experience a subjective millennium in the time that ordinary humans experience 1 objective year, and it seems hard to offer much assurance that an em civilization will remain stable over 10s of 1000s of subjective em years.

Comment by Lukas Finnveden (Lanrian) on Matthew Barnett's Shortform · 2024-01-28T22:40:18.755Z · LW · GW

Here's an argument for why the change in power might be pretty sudden.

  • Currently, humans have most wealth and political power.
  • With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that's the right thing to do.)
  • With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
  • So if misaligned AI ever have a big edge over humans, they may suspect that's only temporary, and then they may need to use it fast.

And given that it's sudden, there are a few different reasons for why it might be violent. It's hard to make deals that hand over a lot of power in a short amount of time (even logistically, it's not clear what humans and AI would do that would give them both an appreciable fraction of hard power going into the future). And the AI systems may want to use an element of surprise to their advantage, which is hard to combine with a lot of up-front negotiation.

Comment by Lukas Finnveden (Lanrian) on Matthew Barnett's Shortform · 2024-01-28T22:07:05.872Z · LW · GW

I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more "optimistic" camp.

Though Paul is also sympathetic to the substance of 'dramatic' stories. C.f. the discussion about how "what failure looks like" fails to emphasize robot armies. 

Comment by Lukas Finnveden (Lanrian) on The case for ensuring that powerful AIs are controlled · 2024-01-24T18:54:24.456Z · LW · GW

I like this direction and this write-up of it!

If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.

Let's assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thinking, it could use that to do notably more serial thinking about how to exploit the current situation than any human has done (because some important features of the current situation is less than 3 months old: e.g. you had to switch to an importantly different programming language, or some different type of communication protocol between AIs, or change your AI safety research paradigm in a way you didn't expect). I'm curious how you'd ensure (or evaluate for) red-team competitiveness in this case.

Comment by Lukas Finnveden (Lanrian) on On the abolition of man · 2024-01-20T22:05:52.006Z · LW · GW

Now here's Bob. He's been created-by-Joe, and given this wonderful machine, and this choice. And let's be clear: he's going to choose joy. I pre-ordained it. So is he a slave? No. Bob is as free as any of us. The fact that the causal history of his existence, and his values, includes not just "Nature," but also the intentional choices of other agents to create an agent-like-him, makes no difference to his freedom. It's all Nature, after all.

Here's an alternative perspective that looks like a plausible contender to me.

If Bob identifies with his algorithm rather than with physics (c.f. this exchange on decision theory), and he's faced with the choice between paperclips and joy, then you could distinguish between cases where:

  • Bob was selected to be in charge of that choice by a process that would only pick an algorithm if it was going to choose joy.
  • Bob was selected to be in charge of that choice by a process that's indifferent to the output that the selected algorithm makes.

(In order to make sure that the chooser always has an option to pick an algorithm that chooses joy, let's extend your thought experiment so that the creator has millions of options — not just Alice and Bob.)

In the former case, I think you could say that Bob can't change whether X or Y gets chosen. (Because if Bob were to choose paperclips, then he would never have received the choice in the first place.) Notably, though, Bob can affect whether he gets physically instantiated and put in charge of the decision between joy and paperclips. (By choosing joy, and thereby making himself available as a candidate.)

On this perspective, the relevant difference wouldn't be "created by nature" vs. "created by agents". Nature could (in principle) create someone via a process that exerts extremely strong selection pressure on that agent's choice in a particular dilemma, thereby eliminating that agent's own freedom to choose its output, there. And conversely, an agent could choose who to create based on some qualitites other than what they'd choose in a particular dilemma — leaving their created agent free to decide on that dilemma, on their own.

Comment by Lukas Finnveden (Lanrian) on The “no sandbagging on checkable tasks” hypothesis · 2023-12-05T22:25:15.858Z · LW · GW

I think (5) also depends on further details.

As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.

But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".)

I think that's important:

  • If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don't think that the experiment is very relevant for whether "you can get it to do X" in the sort of high-stakes, high-efforts situations that I'm imagining that we'll be in when we're trying to eval/align AI models to avoid takeover.
    • It seems super plausible that a low-effort attempt could fail, and then succeed later-on with 10 more years knowledge of best practices. I wouldn't learn much from that happening once.
  • If both the 2023 and the 2033 attempts are really expensive and high-effort (e.g. 1% of pre-training budget), then I think it's very plausible that the 2033 training run gave the model new capabilities that it didn't have before.
    • And in particular: capabilities that the model wouldn't have been able to utilize in a takeover attempt that it was very motivated to do its best at. (Which is ultimately what we care about.)

 

By a similar argument, I would think that (4) wouldn't falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.

 

One thing that I'm noticing while writing this is something like: The argument that "elicitation efforts would get to use ≥1% of the training budget" makes sense if we're eliciting all the capabilities at once, or if there's only a few important capabilities to elicit. But if we want to elicit ≥1000 different capabilities, and need to do separate attempts at each of them, then it's unlikely that each attempt will get to use ≥1% of the training budget.

And for low-effort elicitation attempts, there'll just always be a good chance that capability-elicitation fails for random reasons unrelated to sand-bagging. So a prediction that elicitation consistently succeeds will easily be falsified.

Not super sure what to do with that. One source of hope might be: If the model systematically tries sandbagging, and you can strongly finetune against that on 10 different types of capabilities, then there's a good chance that you'll have trained-out the sandbagging impulse across a much wider array of data. So maybe you can just do 10 high-effort elicitation attempts, on a random subset of capabilities, and hope that it generalizes. (This is basically an argument that there eventually won't be ≥1000 different capabilities which requires separate elicitation efforts. Because there will be one fundamental thing that you need: "getting the model to always try its best and not try to sandbag".)

Comment by Lukas Finnveden (Lanrian) on Thoughts on responsible scaling policies and regulation · 2023-10-28T03:05:28.903Z · LW · GW

Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.

Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for:

  • Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.)
  • Regulation of pharmaceuticals leading to reduced side-effects from drugs. (Compared to a regime where people can mostly sell what they want, and drugs only get banned after people notice that they're causing harm.)
  • Worker protection standards. (Wikipedia claims that the Netherlands has a ~17x lower rate of fatal workplace accidents than the US, which is ~22x lower than India.) I don't know what's driving the differences here, but the difference between the US and Netherlands suggests that it's not all "individuals can afford to take lower risks in richer countries".
Comment by Lukas Finnveden (Lanrian) on AI as a science, and three obstacles to alignment strategies · 2023-10-27T22:43:27.687Z · LW · GW

Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?

Comment by Lukas Finnveden (Lanrian) on Responsible Scaling Policies Are Risk Management Done Wrong · 2023-10-26T18:26:00.562Z · LW · GW

But most of the deficiencies you point out in the third column of that table is about missing and insufficient risk analysis. E.g.:

  • "RSPs doesn’t argue why systems passing evals are safe".
  • "the ISO standard asks the organization to define risk thresholds"
  • "ISO proposes a much more comprehensive procedure than RSPs"
  • "RSPs don’t seem to cover capabilities interaction as a major source of risk"
  • "imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?"

If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".

This makes me wonder: Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."

Comment by Lukas Finnveden (Lanrian) on Book Review: Going Infinite · 2023-10-25T05:05:08.976Z · LW · GW

But even after that, Caroline didn’t turn on Sam yet.

This should say Constance.

Comment by Lukas Finnveden (Lanrian) on RSPs are pauses done right · 2023-10-15T18:00:06.957Z · LW · GW

Instead, ARC explicitly tries to paint the moratorium folks as "extreme".

Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

Comment by Lukas Finnveden (Lanrian) on Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong · 2023-09-14T17:04:43.142Z · LW · GW

That would mean that believed he had a father with the same reasons, who believed he had a father with the same reasons, who believed he had a father with the same reasons...

I.e., this would require an infinite line of forefathers. (Or at least of hypothetical, believed-in forefathers.)

If anywhere there's a break in the chain — that person would not have FDT reasons to reproduce, so neither would their son, etc.

Which makes it disanalogous from any cases we encounter in real life. And makes me more sympathetic to the FDT reasoning, since it's a stranger case where I have less strong pre-existing intuitions.