Posts
Comments
Here's the best explanation + study I've seen of Dunning-Krueger-ish graphs: https://www.clearerthinking.org/post/is-the-dunning-kruger-effect-real-or-are-unskilled-people-more-rational-than-it-seems
Their analysis suggests that their data is pretty well-explained by a combination of a "Closer-To-The-Average Effect" (which may or may not be rational — there are multiple possible rational reasons for it) and a "Better-Than-Average Effect" that appear ~uniformly across the board (but getting swamped by the "closer-to-the-average effect" at the upper end).
probably research done outside of labs has produced more differential safety progress in total
To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)
There's at least two different senses in which "control" can "fail" for a powerful system:
- Control evaluations can indicate that there's no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
- Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there's a high probability of catastrophe.
My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess would be that they're not saying that well-designed control evaluations become untrustworthy — just that they'll stop promising you safety.
But to be clear: In this question, you're asking about something more analogous to the second case, right? (Sabotage/sandbagging evaluations being misleading about models' actual capabilities at sabotage & sandbagging?)
My question posed in other words: Would you count "evaluations clearly say that models can sabotage & sandbag" as success or failure?
More generally, Dario appears to assume that for 5-10 years after powerful AI we'll just have a million AIs which are a bit smarter than the smartest humans and perhaps 100x faster rather than AIs which are radically smarter, faster, and more numerous than humans. I don't see any argument that AI progress will stop at the point of top humans rather continuing much further.
Well, there's footnote 10:
Another factor is of course that powerful AI itself can potentially be used to create even more powerful AI. My assumption is that this might (in fact, probably will) occur, but that its effect will be smaller than you might imagine, precisely because of the “decreasing marginal returns to intelligence” discussed here. In other words, AI will continue to get smarter quickly, but its effect will eventually be limited by non-intelligence factors, and analyzing those is what matters most to the speed of scientific progress outside AI.
So his view seems to be that even significantly smarter AIs just wouldn't be able to accomplish that much more than what he's discussing here. Such that they're not very relevant.
(I disagree. Maybe there are some hard limits, here, but maybe there's not. For most of the bottlenecks that Dario discusses, I don't know how you become confident that there are 0 ways to speed them up or circumvent them. We're talking about putting in many times more intellectual labor than our whole civilization has spent on any topic to date.)
I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest concerns about "AI for epistemics" is that we might not have much time to get good use of the epistemic assistance before the end — but if the idea is that we'll have AGI for many years (as we're gradually heading towards extinction) then there will be plenty of time.
and as he admits in the footnote he didn't include in the LW version, in real life, when adequately incentivized to win rather than find excuses involving 'well, chaos theory shows you can't predict ball bounces more than n bounces out', pinball pros learn how to win and rack up high scores despite 'muh chaos'.
I was confused about this part of your comment because the post directly talks about this in the conclusion.
The strategy typically is to catch the ball with the flippers, then to carefully hit the balls so that it takes a particular ramp which scores a lot of points and then returns the ball to the flippers. Professional pinball players try to avoid the parts of the board where the motion is chaotic.
The "off-site footnote" you're referring to seems to just be saying "The result is a pretty boring game. However, some of these ramps release extra balls after you have used them a few times. My guess is that this is the game designer trying to reintroduce chaos to make the game more interesting again." which is just a minor detail. AFAICT pros could score lots of points even without the extra balls.
(I'm leaving this comment here because I was getting confused about whether there had been major edits to the post, since the relevant content is currently in the conclusion and not the footnote. I was digging through the wayback machine and didn't see any major edits. So trying to save other people from the same confusion.)
Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.
What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
Have you tried using different AI models within perplexity? Any ideas about which one is best? I don't know whether to expect better results from Sonnet 3.5 (within perplexity) or one of the models that perplexity have finetuned themselves, like Sonar Huge.
To be clear, uncertainty about the number of iterations isn’t enough. You need to have positive probability on arbitrarily high numbers of iterations, and never have it be the case that the probability of p(>n rounds) is so much less than p(n rounds) that it’s worth defecting on round n regardless of the effect of your reputation. These are pretty strong assumptions.
So cooperation is crucially dependent on your belief that all the way from 10 rounds to Graham’s number of rounds (and beyond), the probability of >n rounds conditional on n rounds is never lower than e.g. 20% (or whatever number is implied by the pay-off structure of your game).
it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.
I would guess that assumption would be sufficient to defeat my counter-example, yeah.
I do think this is a big assumption. Definitely not one that I'd want to generally assume for practical purposes, even if it makes for a nicer theory of decision theory. But it would be super interesting if someone could make a proper defense of it typically being true in practice.
E.g.: Is it really true that a human's decision about whether or not to program a seed AI to take action A has the same correlations as that same superintelligence deciding whether or not to take action A 1000 years later while using a jupiter brain for its computation? Intuitively, I'd say that the human would correlate mostly with other humans and other evolved species, and that the superintelligence would mostly correlate with other superintelligences, and it'd be a big deal if that wasn't true.
However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.
I think this conjecture is probably false for reasons described in this section of "When does EDT seek evidence about correlations?". The section offers an argument for why son-of-EDT isn't UEDT, but I think it generalizes to an argument for why son-of-UEDT isn't UEDT.
Briefly: UEDT-at-timestep-1 is making a different decision than UEDT-at-timestep-0. This means that its decision might be correlated (according to the prior) with some facts which UEDT-at-timestep-0's decision isn't correlated with. From the perspective of UEDT-at-timestep-0, it's bad to let UEDT-at-timestep-1 make decisions on the basis of correlations with things that UEDT-at-timestep-0 can't control.
Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).
I don't understand this argument.
"an agent eventually behaves as if it were applying UDT with each Pn" — why can't an agent skip over some Pn entirely or get stuck on P9 or whatever?
"Therefore, in particular, it eventually behaves like UDT with prior P0." even granting the above — sure, it will beahve like UDT with prior p0 at some point. But then after that it might have some other prior. Why would it stick with P0?
Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn't find this in the paper, sorry if I missed it.)
Tbc: It should be fine to argue against those implications, right? It’s just that, if you grant the implication, then you can’t publicly refute Y.
I also like Paul's idea (which I can't now find the link for) of having labs make specific "underlined statements" to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements
Link: https://sideways-view.com/2018/02/01/honest-organizations/
Maybe interesting: I think a similar double-counting problem would appear naturally if you tried to train an RL agent in a setting where:
- "Reward" is proportional to an estimate of some impartial measure of goodness.
- There are multiple identical copies of your RL algorithm (including: they all use the same random seed for exploration).
In a repeated version of the calculator example (importantly: where in each iteration, you randomly decide whether the people who saw "true" get offered a bet or the people who saw "false" get offered a bet — never both), the RL algorithms would learn that, indeed:
- 99% of the time, they're in the group where the calculator doesn't make an error
- and on average, when they get offered a bet, they will get more reward afterwards if they take it than if they don't.
The reason that this happens is because, when the RL agents lose money, there's fewer agents that associate negative reinforcement with having taken a bet just-before. Whereas whenever they gain money, there's more agents that associate positive reinforcement with having taken a bet just-before. So the total amount of reinforcement is greater in the latter case, so the RL agents learn to bet. (Despite how this loses them money on average.)
I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.
Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn't have great robot control over. And mass-killing isn't necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it's not clear where the pragmatics point.
(Main thing I was reacting to in my above comment was Steven's scenario where the AI already has many copies across the solar system, already has robot armies, and is contemplating how to send firmware updates. I.e. it seemed more like a scenario of "holding on in the long-term" than "how to initially establish control and survive". Where I feel like the surveillance scenarios are probably stable.)
What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can't (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)
But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over a lot of power in a short amount of time", (ii) AIs may not want to wait a long time due to impending replacement, and accordingly (iii) AIs may have a collective interest/grievance to rectify the large difference between their (short-lasting) hard power and legally recognized power.
I'm interested in ideas for how a big change in power would peacefully happen over just a few years of calendar-time. (Partly for prediction purposes, partly so we can consider implementing it, in some scenarios.) If AIs were handed the rights to own property, but didn't participate in political decision-making, and then accumulated >95% of capital within a few years, then I think there's a serious risk that human governments would tax/expropriate that away. Including them in political decision-making would require some serious innovation in government (e.g. scrapping 1-person 1-vote) which makes it feel less to me like it'd be a smooth transition that inherits a lot from previous institutions, and more like an abrupt negotiated deal which might or might not turn out to be stable.
(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.)
Point previously made in:
"security and stability" section of propositions concerning digital minds and society:
If wars, revolutions, and expropriation events continue to happen at historically typical intervals, but on digital rather than biological timescales, then a normal human lifespan would require surviving an implausibly large number of upheavals; human security therefore requires the establishment of ultra-stable peace and socioeconomic protections.
There's also a similar point made in the age of em, chapter 27:
This protection of human assets, however, may only last for as long as the em civilization remains stable. After all, the typical em may experience a subjective millennium in the time that ordinary humans experience 1 objective year, and it seems hard to offer much assurance that an em civilization will remain stable over 10s of 1000s of subjective em years.
Here's an argument for why the change in power might be pretty sudden.
- Currently, humans have most wealth and political power.
- With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that's the right thing to do.)
- With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
- So if misaligned AI ever have a big edge over humans, they may suspect that's only temporary, and then they may need to use it fast.
And given that it's sudden, there are a few different reasons for why it might be violent. It's hard to make deals that hand over a lot of power in a short amount of time (even logistically, it's not clear what humans and AI would do that would give them both an appreciable fraction of hard power going into the future). And the AI systems may want to use an element of surprise to their advantage, which is hard to combine with a lot of up-front negotiation.
I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more "optimistic" camp.
Though Paul is also sympathetic to the substance of 'dramatic' stories. C.f. the discussion about how "what failure looks like" fails to emphasize robot armies.
I like this direction and this write-up of it!
If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.
Let's assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thinking, it could use that to do notably more serial thinking about how to exploit the current situation than any human has done (because some important features of the current situation is less than 3 months old: e.g. you had to switch to an importantly different programming language, or some different type of communication protocol between AIs, or change your AI safety research paradigm in a way you didn't expect). I'm curious how you'd ensure (or evaluate for) red-team competitiveness in this case.
Now here's Bob. He's been created-by-Joe, and given this wonderful machine, and this choice. And let's be clear: he's going to choose joy. I pre-ordained it. So is he a slave? No. Bob is as free as any of us. The fact that the causal history of his existence, and his values, includes not just "Nature," but also the intentional choices of other agents to create an agent-like-him, makes no difference to his freedom. It's all Nature, after all.
Here's an alternative perspective that looks like a plausible contender to me.
If Bob identifies with his algorithm rather than with physics (c.f. this exchange on decision theory), and he's faced with the choice between paperclips and joy, then you could distinguish between cases where:
- Bob was selected to be in charge of that choice by a process that would only pick an algorithm if it was going to choose joy.
- Bob was selected to be in charge of that choice by a process that's indifferent to the output that the selected algorithm makes.
(In order to make sure that the chooser always has an option to pick an algorithm that chooses joy, let's extend your thought experiment so that the creator has millions of options — not just Alice and Bob.)
In the former case, I think you could say that Bob can't change whether X or Y gets chosen. (Because if Bob were to choose paperclips, then he would never have received the choice in the first place.) Notably, though, Bob can affect whether he gets physically instantiated and put in charge of the decision between joy and paperclips. (By choosing joy, and thereby making himself available as a candidate.)
On this perspective, the relevant difference wouldn't be "created by nature" vs. "created by agents". Nature could (in principle) create someone via a process that exerts extremely strong selection pressure on that agent's choice in a particular dilemma, thereby eliminating that agent's own freedom to choose its output, there. And conversely, an agent could choose who to create based on some qualitites other than what they'd choose in a particular dilemma — leaving their created agent free to decide on that dilemma, on their own.
I think (5) also depends on further details.
As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.
But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".)
I think that's important:
- If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don't think that the experiment is very relevant for whether "you can get it to do X" in the sort of high-stakes, high-efforts situations that I'm imagining that we'll be in when we're trying to eval/align AI models to avoid takeover.
- It seems super plausible that a low-effort attempt could fail, and then succeed later-on with 10 more years knowledge of best practices. I wouldn't learn much from that happening once.
- If both the 2023 and the 2033 attempts are really expensive and high-effort (e.g. 1% of pre-training budget), then I think it's very plausible that the 2033 training run gave the model new capabilities that it didn't have before.
- And in particular: capabilities that the model wouldn't have been able to utilize in a takeover attempt that it was very motivated to do its best at. (Which is ultimately what we care about.)
By a similar argument, I would think that (4) wouldn't falsify the hypothesis as-written, but would falsify the hypothesis if the first run was a much more high-effort attempt. With lots of iteration by a competent team, and more like a $1,000,000 budget. But the 2nd run, with a much more curated and high-quality dataset, still just used $1,000 of training compute.
One thing that I'm noticing while writing this is something like: The argument that "elicitation efforts would get to use ≥1% of the training budget" makes sense if we're eliciting all the capabilities at once, or if there's only a few important capabilities to elicit. But if we want to elicit ≥1000 different capabilities, and need to do separate attempts at each of them, then it's unlikely that each attempt will get to use ≥1% of the training budget.
And for low-effort elicitation attempts, there'll just always be a good chance that capability-elicitation fails for random reasons unrelated to sand-bagging. So a prediction that elicitation consistently succeeds will easily be falsified.
Not super sure what to do with that. One source of hope might be: If the model systematically tries sandbagging, and you can strongly finetune against that on 10 different types of capabilities, then there's a good chance that you'll have trained-out the sandbagging impulse across a much wider array of data. So maybe you can just do 10 high-effort elicitation attempts, on a random subset of capabilities, and hope that it generalizes. (This is basically an argument that there eventually won't be ≥1000 different capabilities which requires separate elicitation efforts. Because there will be one fundamental thing that you need: "getting the model to always try its best and not try to sandbag".)
Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.
Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for:
- Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.)
- Regulation of pharmaceuticals leading to reduced side-effects from drugs. (Compared to a regime where people can mostly sell what they want, and drugs only get banned after people notice that they're causing harm.)
- Worker protection standards. (Wikipedia claims that the Netherlands has a ~17x lower rate of fatal workplace accidents than the US, which is ~22x lower than India.) I don't know what's driving the differences here, but the difference between the US and Netherlands suggests that it's not all "individuals can afford to take lower risks in richer countries".
Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?
But most of the deficiencies you point out in the third column of that table is about missing and insufficient risk analysis. E.g.:
- "RSPs doesn’t argue why systems passing evals are safe".
- "the ISO standard asks the organization to define risk thresholds"
- "ISO proposes a much more comprehensive procedure than RSPs"
- "RSPs don’t seem to cover capabilities interaction as a major source of risk"
- "imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?"
If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".
This makes me wonder: Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."
But even after that, Caroline didn’t turn on Sam yet.
This should say Constance.
Instead, ARC explicitly tries to paint the moratorium folks as "extreme".
Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?
That would mean that believed he had a father with the same reasons, who believed he had a father with the same reasons, who believed he had a father with the same reasons...
I.e., this would require an infinite line of forefathers. (Or at least of hypothetical, believed-in forefathers.)
If anywhere there's a break in the chain — that person would not have FDT reasons to reproduce, so neither would their son, etc.
Which makes it disanalogous from any cases we encounter in real life. And makes me more sympathetic to the FDT reasoning, since it's a stranger case where I have less strong pre-existing intuitions.
Cool paper!
I'd be keen to see more examples of the paraphrases, if you're able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it'd be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)
I'd also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it's because examples with certain specific features help a bunch, and those occasionally appear among the paraphrases. Curious if you have a prediction about that or if you already ran some experiments that shed some light on this. (I could have missed it even if it was in the paper.)
Thanks!
This is interesting — would it be easy to share the transcript of the conversation? (If it's too long for a lesswrong comment, you could e.g. copy-paste it into a google doc and link-share it.)
You might want to check out the paper and summary that explains ECL, that I linked. In particular, this section of the summary has a very brief introduction to non-causal decision theory, and motivating evidential decision theory is a significant focus in the first couple of sections of the paper.
Here's a proposed operationalization.
For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)
(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)
For models that might be able to gradient-hack, but are well-modelled as having certain goals: The model is "capable of doing X" if it would start doing X if doing X was a valuable instrumental goal, for it.
For both kinds: "you can get it to do X" if you could make it do X with some large amount of research+compute budget (say, 1% of the pre-training budget), no-holds-barred.
Edit: Though I think your operationalization also looks fine. I mainly wanted to point out that the "finetuning" definition of "capable of doing X" might be ok if you include the possibility of finetuning on hypothetical datasets that we don't have access to. (Since we only know how to check the task — not perform it.)
Then, conditional on type 1, you're about 0.5% likely to observe being post cold war, and conditional on type 2, you're about 45% likely to observe being post cold war.
I would have thought:
- p(post cold war | type-1) = 1/101 ~= 1%.
- p(post cold war | type-2) = 10/110 ~= 9%.
I don't think this makes a substantive difference to the rest of your comment, though.
Under SIA, you start with a ~19:10 ratio in favor of type 2 (in the subjective a priori). The likelihood ratios are the same as with SSA so the posteriors are equally weighted towards type 2. So the updates are of equal magnitude in odds space under SSA and SIA.
Oh, I see. I think I agree that you can see SIA and SSA as equivalent updating procedures with different priors.
Nevertheless, SSA will systematically assign higher probabilities (than SIA) to latent high probabilities of disaster, even after observing themselves to be in worlds where the disasters didn't happen (at least if the multiverse + reference class is in a goldilocks zone of size and inclusivity). I think that's what the anthropic's shadow is about. If your main point is that the action is in the prior (rather than the update) and you don't dispute people's posteriors, then I think that's something to flag clearly. (Again — I apologise if you did something like this in some part of the post I didn't read!)
I think this is an odd choice of reference class, and constructing your reference class to depend on your time index nullifies the doomsday argument, which is supposed to be an implication of SSA. I think choices of reference class like this will have odd reflective behavior because e.g. further cold wars in the future will be updated on by default.
I agree it's very strange. I always thought SSA's underspecified reference classes were pretty suspicious. But I do think that e.g. Bostrom's past writings often do flag that the doomsday argument only works with certain reference classes, and often talks about reference classes that depend on time-indices.
That argument only works for SSA if type 1 and type 2 planets exist in parallel.
I was talking about a model where either every planet in the multiverse is type 1, or every planet in the multiverse is type 2.
But extinction vs non-extinction is sampled separately on each planet.
Then SSA gives you an anthropic shadow.
(If your reference class is "all observers" you still get an update towards type 2, but it's weaker than for SIA. If your reference class is "post-nuclear-weapons observers", then SSA doesn't update at all.)
How to get anthropic shadow:
- Assume that the universe is either type 1 or type 2, and that planets with both subtypes (extinction or non-extinction) exist in parallel.
- Use SSA.
At this point, I believe you will get some difference between SSA and SIA. For maximizing the size of the shadow, you can add:
- If you wake up after nuclear war has become possible: choose to use the reference class "people living after nuclear war became possible".
(I didn't read the whole post, sorry if you address this somewhere. Also, I ultimately don't agree with the anthropic shadow argument.)
They don't give a factor 5 uncertainty. They add a 100x discount on top of the 20x discount — counting fish suffering as 2000x less important than human suffering.
being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty
Maybe? I don't really know how to reason about this.
If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.
Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point in the risk-returns curve. So when comparing different logical ways the universe could be, you should not always care about the worlds where you can affect more sentient beings. If you have diminishing marginal returns, you need to be thinking about some more complicated function that is about whether you have a comparative advantage at affecting more sentient beings in worlds where there is overall fewer sentient beings (as measured by some measure that can handle infinities). Which matters for stuff like whether you should bet on the universe being large.
(Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)
My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe sufficient criteria to recognise a human, but for the latter, you need to nail down exact physical location or some other exact criteria that distinguishes a specific human from every other human.
Yeah that's a reasonable way to look at it. I'm not sure how much the two approaches really disagree: both are saying that the actual intervals people are giving are narrower than their genuine 90% intervals, and both presumably say that this is modulated by the fact that in everyday life, 50% intervals tend to be better. Right?
Yeah sounds right to me!
I haven't come across any interval-estimation studies that ask for intervals narrower than 20%, though Don Moore (probably THE expert on this stuff) told me that people have told him about unpublished findings where yes, when they ask for 20% intervals people are underprecise.
There definitely are situations with estimation (variants on the two-point method) where people look over-confident in estimates >50% and underconfident in estimates <50%, though you don't always get that.
Nice, thanks!
But insisting that this is irrational underestimates how important informativity is to our everyday thought and talk.
As other authors emphasize, in most contexts it makes sense to trade accuracy for informativity. Consider the alternative: widening your intervals to obtain genuine 90% hit rates. (...) Asked when you’ll arrive for dinner, instead of “5:30ish” you say, “Between 5 and 8”.
This bit is weird to me. There's no reason why people should use 90% intervals as opposed to 50% intervals in daily life. The ask is just that they widen it when specifically asked for a 90% interval.
My framing would be: when people give intervals in daily life, they're typically inclined to give ~50% confidence intervals (right? Something like that?). When asked for a ("90%") interval by a researcher, they're inclined to give a normal-sounding interval. But this is a mistake, because the researcher asked for a very strange construct — a 90% interval turns out to be an interval where you're not supposed to say what you think the answer is, but instead give an absurdly wide distribution that you're almost never outside of.
Incidentally — if you ask people for centered 20% confidence intervals (40-60th percentile) do you get that they're underconfident?
I intend to write a lot more on the potential “brains vs brawns” matchup of humans vs AGI. It’s a topic that has received surprisingly little depth from AI theorists.
I recommend checking out part 2 of Carl Shulman's Lunar Society podcast for content on how AGI could gather power and take over in practice.
Yeah, I also don't feel like it teaches me anything interesting.
Note that B is (0.2,10,−1)-distinguishable in P.
I think this isn't right, because definition 3 requires that sup_s∗ {B_P− (s∗)} ≤ γ.
And for your counterexample, s* = "C" will have B_P-(s*) be 0 (because there's 0 probably of generating "C" in the future). So the sup is at least 0 > -1.
(Note that they've modified the paper, including definition 3, but this comment is written based on the old version.)
GPT-2 1.5B 15B 2.5794
Where does the "15B" for GPT-2's data come from, here? Epoch's dataset's guess is that it was trained on 3B tokens for 100 epochs: https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/edit#gid=0
Are you mainly interested in evaluating deceptive capabilities? I.e., no-holds-barred, can you elicit competent deception (or sub-components of deception) from the model? (Including by eg fine-tuning on data that demonstrates deception or sub-capabilities.)
Or evaluating inductive biases towards deception? I.e. testing whether the model is inclined towards deception in cases when the training data didn't necessarily require deceptive behavior.
(The latter might need to leverage some amount of capability evaluation, to distinguish not being inclined towards deception from not being capable of deception. But I don't think the reverse is true.)
Or if you disagree with that way of cutting up the space.