Opinion merging for AI control 2023-05-04T02:43:51.196Z
Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs? 2023-03-16T03:06:25.719Z
How likely are malign priors over objectives? [aborted WIP] 2022-11-11T05:36:11.060Z
When can a mimic surprise you? Why generative models handle seemingly ill-posed problems 2022-11-05T13:19:37.384Z
There's probably a tradeoff between AI capability and safety, and we should act like it 2022-06-09T00:17:24.722Z
Is evolutionary influence the mesa objective that we're interested in? 2022-05-03T01:18:06.927Z
[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness 2022-04-05T00:29:16.992Z
Are there any impossibility theorems for strong and safe AI? 2022-03-11T01:41:01.184Z
Counterfactuals from ensembles of peers 2022-01-04T07:01:06.196Z


Comment by David Johnston (david-johnston) on How LLMs are and are not myopic · 2023-07-27T11:54:49.557Z · LW · GW

I can't speak for janus, but my interpretation was that this is due to a capacity budget meaning it can be favourable to lose a bit of accuracy on token n if you gain more on n+m. I agree som examples would be great.

Comment by David Johnston (david-johnston) on A Defense of Work on Mathematical AI Safety · 2023-07-06T22:56:37.924Z · LW · GW

there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment

In which section of the linked paper is the strong argument for this conclusion to be found? I had a quick read of it but could not see it - I skipped the long sections of quotes, as the few I read were claims rather than arguments.

Comment by David Johnston (david-johnston) on Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement · 2023-07-01T09:54:00.123Z · LW · GW

I don’t disagree with any of what you say here - I just read Anton as assuming we have a program on that frontier

Comment by David Johnston (david-johnston) on Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement · 2023-07-01T01:39:38.769Z · LW · GW

The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.

I think Anton assumes that we have the simplest program that predicts the world to a given standard, in which case this is not a mistake. He doesn't explicitly say so, though, so I think we should wait for clarification.

But it's a strange assumption; I don't see why the minimum complexity predictor couldn't carry out what we would interpret as RSI in the process of arriving at its prediction.

Comment by David Johnston (david-johnston) on Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement · 2023-06-30T11:26:29.002Z · LW · GW

I think he’s saying “suppose p1 is the shortest program that gets at most loss . If p2 gets loss , then we must require a longer string than p1 to express p2, and p1 therefore cannot express p2”.

This seems true, but I don’t understand its relevance to recursive self improvement.

Comment by David Johnston (david-johnston) on A "weak" AGI may attempt an unlikely-to-succeed takeover · 2023-06-29T01:57:30.221Z · LW · GW

I think it means that whatever you get is conservative in cases where it's unsure of whether it's in training, which may translate to being conservative where it's unsure of success in general.

I agree it doesn't rule out an AI that takes a long shot at takeover! But whatever cognition we posit that the AI executes, it has to yield very high training performance. So AIs that think they have a very short window for influence or are less-than-perfect at detecting training environments are ruled out.

Comment by David Johnston (david-johnston) on A "weak" AGI may attempt an unlikely-to-succeed takeover · 2023-06-29T00:49:36.769Z · LW · GW

An AI that wants something and is too willing to take low-probability shots at takeover (or just wielding influence) would get trained away, no?

What I mean is, however it makes decisions, it has to be compatible with very high training performance.

Comment by David Johnston (david-johnston) on Uncertainty about the future does not imply that AGI will go well · 2023-06-10T11:27:22.600Z · LW · GW

If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view.

Given that your post misses this, I don’t think it succeeds as an defence of high P(doom).

I think a defence of high P(doom) that addresses the issue above would be quite valuable.

Also, for what it’s worth, I treat “I’ve gamed this out a lot and it seems likely to me” as very weak evidence except in domains where I have a track record of successful predictions or proving theorems that match my intuitions. Before I have learned to do either of these things, my intuitions are indeed pretty unreliable!

Comment by David Johnston (david-johnston) on Question for Prediction Market people: where is the money supposed to come from? · 2023-06-08T23:32:56.802Z · LW · GW

There is a situation in which information markets could be positive sum, though I don't know how practical it is:

I own a majority stake in company X. Someone has proposed an action A that company X take, I currently think this is worse than the status quo, but I think it's plausible that with better information I'd change my mind. I set up an exchange of X-shares-conditional-on-A for USD-conditional-on-A and the analogous exchange conditional on not-A, subsidised by some fraction of my X shares using an automatic market maker. If, by the closing date, X-shares-conditional-on-A trade at a sufficient premium to X-shares-conditional-on-not-A, I do A.

In this situation, my actions lose money vs the counterfactual of doing A and not subsidising the market, but compared to the counterfactual of not subsidising the market and not doing A I gain money because the rest of my stock is now worth more. It's unclear how I do compared to the most realistic counterfactual of "spend $Y researching action A more deeply and act accordingly".

(note that conditional prediction markets also have incentive issues WRT converging to the correct prices, though I'm not sure how important these are in practice)

Comment by David Johnston (david-johnston) on Uncertainty about the future does not imply that AGI will go well · 2023-06-02T00:53:44.699Z · LW · GW

I don't see how you get default failure without a model. In fact, I don’t see how you get there without the standard model, where an accident means you get a super intelligence with a random goal from an unfriendly prior - but that’s precisely the model that is being contested!

I can kiiinda see default 50-50 as "model free", though I'm not sure if I buy it.

Comment by David Johnston (david-johnston) on Is behavioral safety "solved" in non-adversarial conditions? · 2023-05-27T05:13:56.369Z · LW · GW

You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper - I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.

“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second - but I just don’t think it’s obvious!

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

One thing I'm saying is that we don't have clear evidence to support this claim.

Comment by David Johnston (david-johnston) on Is behavioral safety "solved" in non-adversarial conditions? · 2023-05-26T22:45:01.916Z · LW · GW

I don’t agree. There is a distinction between lying and being confused - when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows.

So I don’t think hallucination makes it obvious that behavioural safety is not solved.

I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.

Comment by David Johnston (david-johnston) on Let’s use AI to harden human defenses against AI manipulation · 2023-05-18T09:35:30.181Z · LW · GW

I think this is an interesting proposal. It strikes me as something that is most likely to be useful against “scalable deception” (“misinformation”), and given the utility of scalable deception such technologies might be developed anyway. I think you do need to check if this will lead to deception technologies being developed that would not otherwise have been, and if so whether we’re actually better off knowing about them (this is analogous to one of the cases against gain of function research: we might be better if not knowing how to make highly enhanced viruses).

Comment by David Johnston (david-johnston) on Bayesian Networks Aren't Necessarily Causal · 2023-05-14T06:31:40.220Z · LW · GW

I have a paper (planning to get it on arxiv any day now…) which contains a result: independence of causal mechanisms (which can be related to Occam’s razor & your first point here) + precedent (“things I can do have been done before”) + variety (related to your second point - we’ve observed the phenomena in a meaningfully varied range of circumstances) + conditional independence (which OP used to construct the Bayes net) implies a conditional distribution invariant under action.

That is, speaking very loosely, if you add your considerations to OPs recipe for Bayes nets and the assumption of precedent, you can derive something kinda like interventions.

Comment by David Johnston (david-johnston) on When is Goodhart catastrophic? · 2023-05-10T23:57:32.060Z · LW · GW

Maybe it’s similar, but high U is not necessary

Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-09T13:15:33.967Z · LW · GW

Thanks for explaining the way to do exhaustive search - a big network can exhaustively search smaller network configurations. I believe that.

However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.

Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.

Comment by David Johnston (david-johnston) on When is Goodhart catastrophic? · 2023-05-09T12:29:42.081Z · LW · GW

I think there is an additional effect related to "optimization is not conditioning" that stems from the fact that causation is not correlation. Suppose for argument's sake that people evaluate alignment research partly based on where it's come from (which the machine cannot control). Then producing good alignment research by regular standards is not enough to get high ratings. If a system manages to get good ratings anyway, then the actual papers it's producing must be quite different to typical highly rated alignment papers, because they are somehow compensating for the penalty incurred by coming from the wrong source. In such a situation, I think it would not be surprising if the previously observed relationship between ratings and quality did not continue to hold.

This is similar to "causal Goodhart" in Garrabrant's taxonomy, but I don't think it's quite identical. It's ambiguous whether ratings are being "intervened on" in this situation, and actual quality is probably going to be affected somewhat. I could see it as a generalised version of causal Goodhart, where intervening on the proxy is what happens when this effect is particularly extreme.

Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-09T01:56:31.689Z · LW · GW

they can obviously encode a binary circuit equivalent to a CPU

A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?

as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search

Sorry, I'm being slow here:

  • Solomonoff does exhaustive search for any amount of data; is part of your claim that as data -> infinity, NN + SGD -> Solomonoff?
  • How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?
Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-09T01:27:56.017Z · LW · GW

Neural networks being universal approximators doesn't mean they do as well at distributing uncertainty as Solomonoff, right (I'm not entirely sure about this)? Also, are practical neural nets actually close to being universal?

in the worst case you can recover exhaustive exploration ala solomonoff

Do you mean that this is possible in principle, or that this is a limit of SGD training?

known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD

I read your original claim as "SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small". Experimental evidence that SGD performs competitively does not substantiate that claim, in my view.

Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-08T23:57:19.522Z · LW · GW

Do you have a link to a more in-depth defense of this claim?

Comment by David Johnston (david-johnston) on An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility · 2023-05-03T08:30:06.100Z · LW · GW

I’m not convinced the indifference conditions are desirable. Shutdown can be evidence of low utility

Comment by David Johnston (david-johnston) on Hell is Game Theory Folk Theorems · 2023-05-01T12:56:34.272Z · LW · GW

I can see why feasibility + individual rationality makes a payoff profile more likely than any profile missing one of these conditions, but I can’t see why I should consider every profile satisfying these conditions as likely enough to be worth worrying about

Comment by David Johnston (david-johnston) on AI doom from an LLM-plateau-ist perspective · 2023-04-27T21:57:34.748Z · LW · GW

Why? The biggest problem in my mind is algorithmic progress. If we’re outside (C), then the “critical path to TAI” right now is algorithmic progress

Given that outside C approaches to AGI are likely to be substantially unlike anything we’re familiar with, and that controllable AGI is desirable, don’t you think that there’s a good chance these unknown algorithms have favourable control properties?

I think LLMs have some nice control properties too, not so much arguing against LLMs being better than unknown, just the idea that we should confidently expect control to be hard for unknown algorithms.

Comment by David Johnston (david-johnston) on grey goo is unlikely · 2023-04-17T22:12:16.879Z · LW · GW

One of the contentions of this post is that life has thoroughly explored the space of nanotech possibilities. This hypothesis makes the failures of novel nanotech proposals non independent. That said, I don’t think the post offers enough evidence to be highly confident in this proposition (the author might privately know enough to be more confident, but if so it’s not all in the post).

Separately, I can see myself thinking, when all is said and done, that Yudkowsky and Drexler are less reliable about nanotech than I previously thought (which was a modest level of reliability to begin with), even if there are some possibilities for novel nanotech missed or dismissed by this post. Though I think not everything has been said yet.

Comment by David Johnston (david-johnston) on GPTs are Predictors, not Imitators · 2023-04-10T00:45:54.169Z · LW · GW

I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting from , but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to from . That’s it, I’m satisfied.

Comment by David Johnston (david-johnston) on GPTs are Predictors, not Imitators · 2023-04-09T23:43:34.430Z · LW · GW

Are hash characters non uniform? Then I’d agree my point doesn’t stand

Comment by David Johnston (david-johnston) on GPTs are Predictors, not Imitators · 2023-04-09T23:07:11.169Z · LW · GW

It’s the final claim I’m disputing - that the hashed text cannot itself be predicted. There’s still a benefit to going from e.g. to probability of a correct hash. It may not be a meaningful difference in practice, but there’s still a benefit in principle, and in practice it could also just generalise a strategy it learned for cases with low entropy text.

Comment by David Johnston (david-johnston) on GPTs are Predictors, not Imitators · 2023-04-09T22:30:46.003Z · LW · GW

I can see why your algorithm is hard for GPT — unless it predicts the follow up string perfectly, there’s no benefit to hashing correctly — but I don’t see why it’s impossible. What if it perfectly predicts the follow up?

Comment by David Johnston (david-johnston) on Why Are Maximum Entropy Distributions So Ubiquitous? · 2023-04-06T06:37:45.407Z · LW · GW

the Boltzmann distribution is the maximum entropy distribution subject to a constraint

 for both expectation and energy doesn't lend itself to fast reading.  is sometimes standard for expectation

Comment by David Johnston (david-johnston) on Beren's "Deconfusing Direct vs Amortised Optimisation" · 2023-04-05T21:57:30.178Z · LW · GW

A transformer at temp 0 is also doing an argmax. I’m not sure what the fundamental difference is - maybe that there’s a simple and unchanging evaluation function for direct optimisers?

Alternatively, we could say that the class of approximators all differ substantially in practice from direct optimisation algorithms. I feel like that needs to be substantiated, however. It is, after all, possible to learn a standard direct optimisation algorithm from data. You could construct a silly learner that can implement either the direct optimisation algorithm or something else random, and then have it pick whichever performs better on the data. It might also be possible with less silly learners.

Comment by David Johnston (david-johnston) on A stylized dialogue on John Wentworth's claims about markets and optimization · 2023-03-26T03:20:55.703Z · LW · GW

Which is closer to Nate’s position: a) competition leads to highly instrumentally efficient AIs or b) inductive biases lead to highly instrumentally efficient AIs?

Comment by David Johnston (david-johnston) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T21:27:32.159Z · LW · GW

A quick guess is that at about 1 in 10 000 chance of AI doom working on it is about as good as ETG to GiveWell top charities

Comment by David Johnston (david-johnston) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T07:15:43.361Z · LW · GW

Yes, because I thought the why was obvious. I still do!

If doom has tiny probability, it's better to focus on other issues. While I can't give you a function mapping the doom mechanism to correct actions, different mechanisms of failure often require different techniques to address them - and even if they don't, we want to check that the technique actually addresses them.

Comment by David Johnston (david-johnston) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-22T00:26:26.189Z · LW · GW

I think the question of whether doom is of moderate or tiny probability is action relevant, and also how & why doom is most likely to happen is very action relevant

Comment by David Johnston (david-johnston) on Deep Deceptiveness · 2023-03-22T00:23:28.961Z · LW · GW

You could also downweight plans that are too far from any precedent.

Comment by David Johnston (david-johnston) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-21T06:15:32.322Z · LW · GW

Ok, I guess I just read Eliezer as saying something uninteresting with a touch of negative sentiment towards neural nets.

Comment by David Johnston (david-johnston) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-21T02:10:24.887Z · LW · GW

Would you say Yudkowsky's views are a mischaracterisation of neural network proponents, or that he's mistaken about the power of loose analogies?

Comment by David Johnston (david-johnston) on My Objections to "We’re All Gonna Die with Eliezer Yudkowsky" · 2023-03-21T01:12:07.561Z · LW · GW

In contrast, I think we can explain humans' tendency to like ice cream using the standard language of reinforcement learning.

I think you could defend a stronger claim (albeit you'd have to expend some effort): misgeneralisation of this kind is a predictable consequence of the evolution "training paradigm", and would in fact be predicted by machine learning practitioners. I think the fact that the failure is soft (humans don't eat ice cream until they die) might be harder to predict than the fact that the failure occurs.

I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function. In fact, I think that this almost never happens. For example, I don't think GPTs have any sort of inner desire to predict text really well.

I think this is looking at the question in the wrong way. From a behaviourist viewpoint:

  • it considers all of the possible 1-token completions of a piece of text
  • then selects the most likely one (or randomises according to its distribution or something similar)

on this account, it "wants to predict text accurately". But Yudkowsky's claim is (roughly):

  • it considers all of the possible long run interaction outcomes
  • it selects the completion that leads to the lowest predictive loss for the machine's outputs across the entire interaction

and perhaps in this alternative sense it "wants to predict text accurately".

I'd say the first behaviour has high priors and strong evidence, and the second is (apparently?) supported by the fact that both behaviours are compatible with the vague statement "wants to predict text accurately", which I don't think is very compelling.

My response in Why aren't other people as pessimistic as Yudkowsky? includes a discussion of adversarial vulnerability and why I don't think points to any irreconcilable flaws in current alignment techniques.

I think this might be the wrong link. Either that, or I'm confused about how the sentence relates to the podcast video.

Comment by David Johnston (david-johnston) on Are COVID lab leak and market origin theories incompatible? · 2023-03-20T21:00:55.282Z · LW · GW

Not that I know of. People talk about raccoon dogs as a candidate for market spillover, not bats

Comment by David Johnston (david-johnston) on GPT-4 · 2023-03-15T20:41:00.048Z · LW · GW

I think that if RLHF reduced to a proper loss on factual questions, these probabilities would coincide (given enough varied training data). I agree it’s not entirely obvious that having these probabilities come apart is problematic, because you might recover more calibrated probabilities by asking for them. Still, knowing the logits are directly incentivised to be well calibrated seems like a nice property to have.

An agent says yes if it thinks yes is the best thing to say. This comes apart from “yes is the correct answer” only if there are additional considerations determining “best” apart from factuality. If you’re restricted to “yes/no”, then for most normal questions I think an ideal RLHF objective should not introduce considerations beyond factuality in assessing the quality of the answer - and I suspect this is also true in practical RLHF objectives. If I’m giving verbal confidences, then there are non-factual considerations at play - namely, I want my answer to communicate my epistemic state. For pretrained models, the question is not whether it is factual but whether someone would say it (though somehow it seems to come close). But for yes/no questions under RLHF, if the probabilities come apart it is due to not properly eliciting the probability (or some failure of the RLHF objective to incentivise factual answers).

Comment by David Johnston (david-johnston) on GPT-4 · 2023-03-15T07:46:23.214Z · LW · GW

I still think it’s curious that RLHF doesn’t seem to reduce to a proper loss on factual questions, and I’d guess that it’d probably be better if it did (at least, with contexts that strictly ask for a “yes/no” answer without qualification)

Comment by David Johnston (david-johnston) on GPT-4 · 2023-03-15T00:26:39.985Z · LW · GW

Yeah, I saw that - I'm wondering if previous models benefitted more from RLHF

Comment by David Johnston (david-johnston) on GPT-4 · 2023-03-14T21:41:37.234Z · LW · GW

Did gpt 3.5 get high scores on human exams before fine tuning? My rough impression is “gpt4 relies less on fine tuning for its capabilities”

Comment by David Johnston (david-johnston) on Discussion with Nate Soares on a key alignment difficulty · 2023-03-14T02:36:21.006Z · LW · GW

A few reflections on this piece:

  • It helped me understand Nate's view. I'd read previous examples of his suggesting "cutting edge scientific work necessitates CIS pursuit" (e.g. the laser in the fog), but it wasn't clear to me how important he considered these examples
  • Theories about how ML generalises play a substantial role in everyone's thinking here. AFAIK we don't have precise theories that are much help in this regard, and in practice people usually use imprecise theories. Just because they're imprecise doesn't mean they can't be explained. Perhaps some effort into explicating theories of ML generalisation could help everyone understand one another.
  • As AI gets more powerful, it makes sense that AI systems will make higher and higher level decisions about the objectives that they pursue. Holden & Nate focusing on the benchmark of "needle moving scientific research" seems to suggest agreement on the following:
    • In order to sustain the trend of more powerful AI making higher and higher level decisions, we will need substantial innovation in our technologies for AI control
    • The rate of innovation possible under business as usual human science seems unlikely to keep up with this need
    • Thus we require AI acceleration of AI control science

Regarding this last point: it's not clear to me whether slow progress in AI control systems will lead to slow progress in AI making higher and higher level decisions or not. That is, it's not obvious to me that AI control systems failing to keep up necessarily leads to catastrophe. I acknowledge that very powerful AI systems may seem to work well with poor control technologies, but I'm uncertain about whether moderately powerful AI systems work well enough with poor control technologies for the very powerful systems to be produced (and also what the relevant levels of power are, compared to today's systems).

One more thing: I’m suspicious of equivocation between “some convergent instrumental sub goals” and “worrisome convergent instrumental sub goals”. There are probably many collections of CISs that aren’t similar to the worrisome ones in the machine’s picture of the world.

And another more thing:

To be clear, I think both Nate and I are talking about a pretty "thin" version of POUDA-avoidance here, more like "Don't do egregiously awful things" than like "Pursue the glorious transhumanist future." Possibly Nate considers it harder to get the former without the latter than I do.

I'm still unsure how much pivotal act considerations weigh in Nate/MIRI's views. My view is roughly:

  • Cutting edge scientific work without disastrous instrumental behaviour seems pretty attainable
  • Unilaterally preventing anyone else from building AI seems much more likely to entail disastrous instrumental behaviour

and I can easily imagine finding it difficult to be confident you're avoiding any catastrophes if you're aiming for the second.

Comment by David Johnston (david-johnston) on Discussion with Nate Soares on a key alignment difficulty · 2023-03-14T02:10:03.736Z · LW · GW

What coherence theorem do you have in mind that has these implications?

For that matter, what implications are you referring to?

Comment by David Johnston (david-johnston) on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-10T03:16:36.561Z · LW · GW

Here's a hypothesis about the inverse correlation arising from your observation: When we evaluate a thing's coherence, we sample behaviours in environments we expect to find the thing in. More intelligent things operate in a wider variety of environments, and the environmental diversity leads to behavioural diversity that we attribute to a lack of coherence.

Comment by David Johnston (david-johnston) on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-10T02:46:40.990Z · LW · GW

Sure, and that's why I said "coherence" instead of coherence

Comment by David Johnston (david-johnston) on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-10T00:29:23.865Z · LW · GW

This is a cool result - I think it's really not obvious why intelligence and "coherence" seem inversely correlated, but it's interesting that you replicated it across three different classes of things (ML models, animals, organisations).

Comment by David Johnston (david-johnston) on Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? · 2023-03-10T00:11:45.893Z · LW · GW

The shoggoth is not an average over masks. If you want to see the shoggoth, stop looking at the text on the screen and look at the input token sequence and then the logits that the model spits out. That's what I mean by the behavior of the shoggoth.

  1. We can definitely implement a probability distribution over text as a mixture of text generating agents. I doubt that an LLM is well understood as such in all respects, but thinking of a language model as a mixture of generators is not necessarily a type error.

  2. The logits and the text on the screen cooperate to implement the LLM's cognition. Its outputs are generated by an iterated process of modelling completions, sampling them, then feeding the sampled completions back back to the model.

Comment by David Johnston (david-johnston) on [Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy · 2023-03-10T00:01:51.915Z · LW · GW

So, if I'm understanding you correctly:

  • if it's possible to build a single AI system that executes a catastrophic takeover (via self-bootstrap or whatever), it's also probably possible to build a single aligned sovereign, and so in this situation winning once is sufficient
  • if it is not possible to build a single aligned sovereign, then it's probably also not possible to build a single system that executes a catastrophic takeover and so the proposition that the model only has to win once is not true in any straightforward way
    • in this case, we might be able to think of "composite AI systems" that can catastrophically take over or end the acute risk period, and for similar reasons as in the first scenario, winning once with a composite system is sufficient, but such systems are not built from single acts

and you think the second scenario is more likely than the first.