Posts

A brief theory of why we think things are good or bad 2024-10-20T20:31:26.309Z
Mechanistic Anomaly Detection Research Update 2024-08-06T10:33:26.031Z
Opinion merging for AI control 2023-05-04T02:43:51.196Z
Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs? 2023-03-16T03:06:25.719Z
How likely are malign priors over objectives? [aborted WIP] 2022-11-11T05:36:11.060Z
When can a mimic surprise you? Why generative models handle seemingly ill-posed problems 2022-11-05T13:19:37.384Z
There's probably a tradeoff between AI capability and safety, and we should act like it 2022-06-09T00:17:24.722Z
Is evolutionary influence the mesa objective that we're interested in? 2022-05-03T01:18:06.927Z
[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness 2022-04-05T00:29:16.992Z
Are there any impossibility theorems for strong and safe AI? 2022-03-11T01:41:01.184Z
Counterfactuals from ensembles of peers 2022-01-04T07:01:06.196Z

Comments

Comment by David Johnston (david-johnston) on RL, but don't do anything I wouldn't do · 2024-12-09T05:00:39.748Z · LW · GW

If you're in a situation where you can reasonably extrapolate from past rewards to future reward, you can probably extrapolate previously seen "normal behaviour" to normal behaviour in your situation. Reinforcement learning is limited - you can't always extrapolate past reward - but it's not obvious that imitative regularisation is fundamentally more limited.

(normal does not imply safe, of course)

Comment by David Johnston (david-johnston) on RL, but don't do anything I wouldn't do · 2024-12-09T00:26:47.125Z · LW · GW

Their empirical result rhymes with adversarial robustness issues - we can train adversaries to maximise ~arbitrary functions subject to small perturbation from ground truth constraints. Here the maximised function is a faulty reward model and the constraint is KL to a base model instead of distance to a ground truth image.

I wonder if multiscale aggregation could help here too as it does with image adversarial robustness. We want the KL penalty to ensure that the generations should look normal at any "scale", whether we look at them token by token or read a high-level summary of them. However, I suspect their "weird, low-KL" generations will have weird high-level summaries, whereas more desired policies would look more normal in summary (though it's not immediately obvious if this translates to low and high probability summaries respectively - one would need to test). I think a KL penalty to the "true base policy" should operate this way automatically, but as the authors note we can't actually implement that.

Comment by David Johnston (david-johnston) on Model Integrity: MAI on Value Alignment · 2024-12-05T22:59:08.187Z · LW · GW

Is your view closer to:

  • there's two hard steps (instruction following, value alignment), and of the two instruction following is much more pressing
  • instruction following is the only hard step; if you get that, value alignment is almost certain to follow
Comment by David Johnston (david-johnston) on o1 is a bad idea · 2024-11-14T06:36:37.165Z · LW · GW

Mathematical reasoning might be specifically conducive to language invention because our ability to automatically verify reasoning means that we can potentially get lots of training data. The reason I expect the invented language to be “intelligible” is that it is coupled (albeit with some slack) to automatic verification.

Comment by David Johnston (david-johnston) on o1 is a bad idea · 2024-11-14T01:48:01.594Z · LW · GW

There's a regularization problem to solve for 3.9 and 4, and it's not obvious to me that glee will be enough to solve it (3.9 = "unintelligible CoT").

I'm not sure how o1 works in detail, but for example, backtracking (which o1 seems to use) makes heavy use of the pretrained distribution to decide on best next moves. So, at the very least, it's not easy to do away with the native understanding of language. While it's true that there is some amount of data that will enable large divergences from the pretrained distribution - and I could imagine mathematical proof generation eventually reaching this point, for example - more ambitious goals inherently come with less data, and it's not obvious to me that there will be enough data in alignment-critical applications to cause such a large divergence.

There's an alternative version of language invention where the model invents a better language for (e.g.) maths then uses that for more ambitious projects, but that language is probably quite intelligible!

Comment by David Johnston (david-johnston) on A brief theory of why we think things are good or bad · 2024-10-22T01:46:49.476Z · LW · GW

For what it's worth, one idea I had as a result of our discussion was this:

  • We form lots of beliefs as a result of motivated reasoning
  • These beliefs are amenable to revision due to evidence, reason or (maybe) social pressure
  • Those beliefs that are largely resilient to these challenges are "moral foundations"

So philosophers like "pain is bad" as a moral foundation because we want to believe it + it is hard to challenge with evidence or reason. Laypeople probably have lots of foundational moral beliefs that don't stand up as well to evidence or reason, but (perhaps) are equally attributable to motivated reasoning.

Social pressure is a bit iffy to include because I think lots of people relate to beliefs that they adopted because of social pressure as moral foundations, and believing something because you're under pressure to do so is an instance of motivated reasoning.

I don't think this is a response to your objections, but I'm leaving it here in case it interests you.

Comment by David Johnston (david-johnston) on A brief theory of why we think things are good or bad · 2024-10-22T01:09:42.086Z · LW · GW

I can explain why I believe bachelors are unmarried: I learned that this is what the word bachelor means, I learned this because it is what bachelor means, and the fact that there's a word "bachelor" that means "unmarried man" is contingent on some unimportant accidents in the evolution of language. A) it is certainly not the result of an axiomatic game and B) if moral beliefs were also contingent on accidents in the evolution of language (I think most are not), that would have profound implications for metaethics.

Motivated belief can explain non-purely-selfish beliefs. I might believe pain is bad because I am motivated to believe it, but the belief still concerns other people. This is even more true when we go about constructing higher order beliefs and trying to enforce consistency among beliefs. Undesirable moral beliefs could be a mark against this theory, but you need more than not-purely-selfish moral beliefs.

I'm going to bow out at this point because I think we're getting stuck covering the same ground.

Comment by David Johnston (david-johnston) on A brief theory of why we think things are good or bad · 2024-10-21T21:49:37.695Z · LW · GW

Thanks for your continued engagement.

I’m interested in explaining foundational moral beliefs like suffering is bad, not beliefs like “animals do/don’t suffer”, which is about badness only because we accept the foundational assumption that suffering is bad. Is that clear in the updated text?

Now, I don’t think these beliefs come from playing axiomatic games like “define good as that which increases welfare”. There are many lines of evidence for this. First: “define bad as that which increases suffering” is not equally as plausible as “define good as that which increases suffering”. We have pre-existing beliefs about this.

Second: you talk about philosophers analysing welfare. However, the method that philosophers use to do this usually involves analysing a bunch of fundamental moral assumptions. For example, from the Stanford encyclopaedia of philosophy:

Correspondingly, no amount of empirical investigation seems by itself, without some moral assumption(s) in play, sufficient to settle a moral question https://plato.stanford.edu/entries/metaethics/

I am suggesting that the source of these fundamental moral assumptions may not be mysterious - we have a known ability to form beliefs based on what we want, and fundamental moral beliefs often align with what we want.

Comment by David Johnston (david-johnston) on A brief theory of why we think things are good or bad · 2024-10-21T03:47:56.429Z · LW · GW

I think precisely defining "good" and "bad" is a bit beside the point - it's a theory about how people come to believe things are good and bad, and we're perfectly capable of having vague beliefs about goodness and badness. That said, the theory is lacking a precise account of what kind of beliefs it is meant to explain.

The LLM section isn't meant as support for the theory, but speculation about what it would say about the status of "experiences" that language models can have. Compared to my pre-existing notions, the theory seems quite willing to accommodate LLMs having good and bad experiences on par with those that people have.

Comment by David Johnston (david-johnston) on A brief theory of why we think things are good or bad · 2024-10-21T00:04:36.116Z · LW · GW

I have a pedantic and a non-pedantic answer to this. Pedantic: you say X is "usually considered good" if it increases welfare. Perhaps you mean to imply that if X is usually considered good then it is good. In this case, I refer you to the rest of the paragraph you quote.

Non-pedantic: yes, it's true that once you accept some fundamental assumptions about goodness and badness you can go about theorising and looking for evidence. I'm suggesting that motivated reasoning is the mechanism that makes those fundamental assumptions believable.

I added a paragraph mentioning this, because I think your reaction is probably common.

Comment by David Johnston (david-johnston) on The Hidden Complexity of Wishes · 2024-10-20T02:25:44.782Z · LW · GW

Here's a basic model of policy collapse: suppose there exist pathological policies of low prior probability (/high algorithmic complexity) such that they play the training game when it is strategically wise to do so, and when they get a good opportunity they defect in order to pursue some unknown aim.

Because they play the training game, a wide variety of training objectives will collapse to one of these policies if the system in training starts exploring policies of sufficiently high algorithmic complexity. So, according to this crude model, there's a complexity bound: stay under it and you're fine, go over it and you get pathological behaviour. Roughly, whatever desired behaviour requires the most algorithmically complex policy is the one that is most pertinent for assessing policy collapse risk (because that's the one that contributes most of the algorithmic complexity, and so it give your first order estimate of whether or not you're crossing the collapse threshold). So, which desired behaviour requires the most complex policy: is it, for example, respecting commonsense moral constraints, or is it inventing molecular nanotechnology?

Tangentially, the policy collapse theory does not predict outcomes that look anything like malicious compliance. It predicts that, if you're in a position of power over the AI system, your mother is saved exactly as you want her to be. If you are not in such a position then your mother is not saved at all and you get a nanobot war instead or something. That is, if you do run afoul of policy collapse, it doesn't matter if you want your system to pursue simple or complex goals, you're up shit creek either way.

Comment by David Johnston (david-johnston) on The Hidden Complexity of Wishes · 2024-10-19T12:11:33.604Z · LW · GW

Algorithmic complexity is precisely analogous to difficulty-of-learning-to-predict, so saying "it's not about learning to predict, it's about algorithmic complexity" doesn't make sense. One read of the original is: learning to respect common sense moral side constraints is tricky[1], but AI systems will learn how to do it in the end. I'd be happy to call this read correct, and is consistent with the observation that today's AI systems do respect common sense moral side constraints given straightforward requests, and that it took a few years to figure out how to do it. That read doesn't really jive with your commentary.

Your commentary seems to situate this post within a larger argument: teaching a system to "act" is different to teaching it to "predict" because in the former case a sufficiently capable learner's behaviour can collapse to a pathological policy, whereas teaching a capable learner to predict does not risk such collapse. Thus "prediction" is distinguished from "algorithmic complexity". Furthermore, commonsense moral side constraints are complex enough to risk such collapse when we train an "actor" but not a "predictor". This seems confused.

First, all we need to turn a language model prediction into an action is a means of turning text into action, and we have many such means. So the distinction between text predictor and actor is suspect. We could consider an alternative knows/cares distinction: does a system act properly when properly incentivised ("knows") vs does it act properly when presented with whatever context we are practically able to give it ("""cares""")? Language models usually act properly given simple prompts, so in this sense they "care". So rejecting evidence from language models does not seem well justified.

Second, there's no need to claim that commonsense moral side constraints in particular are so hard that trying to develop AI systems that respect them leads to policy collapse. It need only be the case that one of the things we try to teach them to do leads to policy collapse. Teaching values is not particularly notable among all the things we might want AI systems to do; it certainly does not seem to be among the hardest. Focussing on values makes the argument unnecessarily weak.

Third, algorithmic complexity is measured with respect to a prior. The post invokes (but does not justify) an "English speaking evil genie" prior. I don't think anyone thinks this is a serious prior for reasoning about advanced AI system behaviour. But the post is (according to your commentary, if not the post itself) making a quantitative point - values are sufficiently complex to induce policy collapse - but it's measuring this quantity using a nonsense prior. If the quantitative argument was indeed the original point, it is mystifying why a nonsense prior was chosen to make it, and also why no effort was made to justify the prior.


  1. the text proposes full value alignment as a solution to the commonsense side constraints problem, but this turned out to be stronger than necessary. ↩︎

Comment by David Johnston (david-johnston) on Access to powerful AI might make computer security radically easier · 2024-06-09T00:30:42.720Z · LW · GW

When do you think is the right time to work in these issues? Monitoring, trust displacement and fine grained permission management all look liable to raise issues that weren’t anticipated and haven’t already been solved, because they’re not the way things have been done historically. My gut sense is that GPT4 performance is much lower when you’re asking it to do novel things. Maybe it’s also possible to make substantial gains with engineering and experimentation, but you’ll need a certain level of performance in order to experiment.

Some wild guesses: maybe the right time to start work is one generation before it’s feasible, and that might mean start now for fine grained permissions, gpt 4.5 for monitoring, gpt 5 for trust displacement.

Comment by David Johnston (david-johnston) on Counting arguments provide no evidence for AI doom · 2024-02-29T07:06:54.131Z · LW · GW

The AI system builders’ time horizon seems to be a reasonable starting point

Comment by David Johnston (david-johnston) on Counting arguments provide no evidence for AI doom · 2024-02-29T06:59:23.802Z · LW · GW

Nora and/or Quentin: you talk a lot about inductive biases of neural nets ruling scheming out, but I have a vague sense that scheming ought to happen in some circumstances - perhaps rather contrived, but not so contrived as to be deliberately inducing the ulterior motive. Do you expect this to be impossible? Can you propose a set of conditions you think sufficient to rule out scheming?

Comment by David Johnston (david-johnston) on Counting arguments provide no evidence for AI doom · 2024-02-29T06:49:42.132Z · LW · GW

What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?

One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.

Comment by David Johnston (david-johnston) on A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is · 2024-02-08T10:56:17.956Z · LW · GW

Another comment on timing updates: if you’re making a timing update for zoonosis vs DEFUSE, and you’re considering a long timing window w_z for zoonosis, then your prior for a DEFUSE leak needs to be adjusted for the short window w_d in which this work could conceivably cause a leak, so you end up with something like p(defuse_pandemic)/p(zoo_pandemic)= rr_d w_d/w_z, where rr_d is the riskiness of DEFUSE vs zoonosis per unit time. Then you make the “timing update” p(now |defuse_pandemic)/p(now |zoo_pandemic) = w_z/w_d and you’re just left with rr_d.

Comment by David Johnston (david-johnston) on A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is · 2024-02-08T10:24:07.795Z · LW · GW

Sorry, I edited (was hoping to get in before you read it)

Comment by David Johnston (david-johnston) on A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is · 2024-02-08T10:06:40.831Z · LW · GW

If your theory is: there is a lab leak from WIV while working on defuse derived work then I’ll buy that you can assign a high probability to time & place … but your prior will be waaaaaay below the prior on “lab leak, nonspecific” (which is how I was originally reading your piece).

Comment by David Johnston (david-johnston) on A Back-Of-The-Envelope Calculation On How Unlikely The Circumstantial Evidence Around Covid-19 Is · 2024-02-08T04:23:25.654Z · LW · GW

You really think in 60% of cases where country A lifts a ban on funding gain of function research a pandemic starts in country B within 2 years? Same question for “warning published in Nature”.

Comment by David Johnston (david-johnston) on A case for AI alignment being difficult · 2024-01-01T23:57:50.197Z · LW · GW

If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.

Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology -> different conclusions is less obvious to me than different data -> different conclusions. If there’s almost no mutual information between the different data then the conclusions have to be different, but sometimes you could come to the same conclusions under different ontologies w/data from the same process.

Comment by David Johnston (david-johnston) on A case for AI alignment being difficult · 2024-01-01T23:46:52.883Z · LW · GW

Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now.

There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially mediated by life history.

Under this view, the utility of the future wrt human values depends substantially on whether, in the future, people learn to be very sensitive to outcome differences. But “people are sensitive to outcome differences and happy with the outcome” does not seem better to me than “people are insensitive to outcome differences and happy with the outcome” (this is a first impression; I could be persuaded otherwise), even though it’s higher utility, whereas “people are unhappy with the outcome” does seem worse than “people are happy with the outcome”.

Under this view, I don’t think this follows:

there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values

My reasoning is that a “default AGI” will have its values contingent on a process which overlaps with the collective life history that determines human values. This is a different situation to values directly determined by evolutionary history, where the process that determines human values is temporally distant and therefore perhaps more-or-less random from the point of view of the AGI. So there’s a compelling reason to believe in value differences in the “evolution history directly determines values” case that’s absent in the “life history determines values” case.

Different values are still totally plausible, of course - I’m objecting to the view that we know they’ll be different.

(Maybe you think this is all an example of humans not really having values, but that doesn’t seem right to me).

Comment by David Johnston (david-johnston) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-26T00:23:36.150Z · LW · GW

You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".

A system that can, under normal circumstances, explain how to solve a problem doesn’t necessarily solve the problem if it gets in the way of explaining the solution. The notion of wanting that Nate proposes is “solving problems in order to achieve the objective”, and this need not apply to the system that explains solutions. In short: yes.

Comment by David Johnston (david-johnston) on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-26T00:00:52.231Z · LW · GW

If we are to understand you as arguing for something trivial, then I think it only has trivial consequences. We must add nontrivial assumptions if we want to offer a substantive argument for risk.

Suppose we have a collection of systems of different ability that can all, under some conditions, solve . Let's say an "-wrench" is an event that defeats systems of lower ability but not systems of higher ability (i.e. prevents them from solving ).

A system that achieves with probability must defeat all -wrenches but those with a probability of at most . If the set of events that are -wrenches but not -wrenches has probability , then the system can defeat all -wrenches but a collection with probability of at most .

That is, if the challenges involved in achieving are almost the same as the challenges involved in achieving , then something good at achieving is almost as good at achieving (granting the somewhat vague assumptions about general capability baked into the definition of wrenches).

However, if is something that people basically approve of and is something people do not approve of, then I do not think the challenges almost overlap. In particular, to do , with high probability you need to defeat a determined opposition, which is not likely to be necessary if you want . That is: no need to kill everyone with nanotech if your doing what you were supposed to.

In order to sustain the argument for risk, we need to assume that the easiest way to defeat -wrenches is to learn a much more general ability to defeat wrenches than necessary and apply it to solving and, furthermore, this ability is sufficient to also defeat -wrenches. This is plausible - we do actually find it helpful to build generally capable systems to solve very difficult problems - but also plausibly false. Even highly capable AI that achieves long-term objectives could end up substantially specialised for those objectives.

As an aside, if the set of -wrenches includes the gradient updates received during training, then an argument that an -solver generalises to a -solver may also imply that deceptive alignment is likely (alternatively, proving that -solvers generalise to -solvers is at least as hard as proving deceptive alignment).

Comment by David Johnston (david-johnston) on AI as a science, and three obstacles to alignment strategies · 2023-10-26T07:24:31.343Z · LW · GW

Two observations:

  1. If you think that people’s genes would be a lot fitter if people cared about fitness more then surely there’s a good chance that a more efficient version of natural selection would lead to people caring more about fitness.

  2. You might, on the other hand, think that the problem is more related to feedbacks. I.e. if you’re the smartest monkey, you can spend your time scheming to have all the babies. If there are many smart monkeys, you have to spend a lot of time worrying about what the other monkeys think of you. If this is how you’re worried misalignment will arise, then I think “how do deep learning models generalise?” is the wrong tree to bark up

C. If people did care about fitness, would Yudkowsky not say “instrumental convergence! Reward hacking!”? I’d even be inclined to grant he had a point.

Comment by David Johnston (david-johnston) on How LLMs are and are not myopic · 2023-07-27T11:54:49.557Z · LW · GW

I can't speak for janus, but my interpretation was that this is due to a capacity budget meaning it can be favourable to lose a bit of accuracy on token n if you gain more on n+m. I agree som examples would be great.

Comment by David Johnston (david-johnston) on A Defense of Work on Mathematical AI Safety · 2023-07-06T22:56:37.924Z · LW · GW

there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment

In which section of the linked paper is the strong argument for this conclusion to be found? I had a quick read of it but could not see it - I skipped the long sections of quotes, as the few I read were claims rather than arguments.

Comment by David Johnston (david-johnston) on Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement · 2023-07-01T09:54:00.123Z · LW · GW

I don’t disagree with any of what you say here - I just read Anton as assuming we have a program on that frontier

Comment by David Johnston (david-johnston) on Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement · 2023-07-01T01:39:38.769Z · LW · GW

The mistake here is the assumption that a program that models the world better necessarily has a higher Kolmogorov complexity.

I think Anton assumes that we have the simplest program that predicts the world to a given standard, in which case this is not a mistake. He doesn't explicitly say so, though, so I think we should wait for clarification.

But it's a strange assumption; I don't see why the minimum complexity predictor couldn't carry out what we would interpret as RSI in the process of arriving at its prediction.

Comment by David Johnston (david-johnston) on Contra Anton 🏴‍☠️ on Kolmogorov complexity and recursive self improvement · 2023-06-30T11:26:29.002Z · LW · GW

I think he’s saying “suppose p1 is the shortest program that gets at most loss . If p2 gets loss , then we must require a longer string than p1 to express p2, and p1 therefore cannot express p2”.

This seems true, but I don’t understand its relevance to recursive self improvement.

Comment by David Johnston (david-johnston) on A "weak" AGI may attempt an unlikely-to-succeed takeover · 2023-06-29T01:57:30.221Z · LW · GW

I think it means that whatever you get is conservative in cases where it's unsure of whether it's in training, which may translate to being conservative where it's unsure of success in general.

I agree it doesn't rule out an AI that takes a long shot at takeover! But whatever cognition we posit that the AI executes, it has to yield very high training performance. So AIs that think they have a very short window for influence or are less-than-perfect at detecting training environments are ruled out.

Comment by David Johnston (david-johnston) on A "weak" AGI may attempt an unlikely-to-succeed takeover · 2023-06-29T00:49:36.769Z · LW · GW

An AI that wants something and is too willing to take low-probability shots at takeover (or just wielding influence) would get trained away, no?

What I mean is, however it makes decisions, it has to be compatible with very high training performance.

Comment by David Johnston (david-johnston) on Uncertainty about the future does not imply that AGI will go well · 2023-06-10T11:27:22.600Z · LW · GW

If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view.

Given that your post misses this, I don’t think it succeeds as an defence of high P(doom).

I think a defence of high P(doom) that addresses the issue above would be quite valuable.

Also, for what it’s worth, I treat “I’ve gamed this out a lot and it seems likely to me” as very weak evidence except in domains where I have a track record of successful predictions or proving theorems that match my intuitions. Before I have learned to do either of these things, my intuitions are indeed pretty unreliable!

Comment by David Johnston (david-johnston) on Question for Prediction Market people: where is the money supposed to come from? · 2023-06-08T23:32:56.802Z · LW · GW

There is a situation in which information markets could be positive sum, though I don't know how practical it is:

I own a majority stake in company X. Someone has proposed an action A that company X take, I currently think this is worse than the status quo, but I think it's plausible that with better information I'd change my mind. I set up an exchange of X-shares-conditional-on-A for USD-conditional-on-A and the analogous exchange conditional on not-A, subsidised by some fraction of my X shares using an automatic market maker. If, by the closing date, X-shares-conditional-on-A trade at a sufficient premium to X-shares-conditional-on-not-A, I do A.

In this situation, my actions lose money vs the counterfactual of doing A and not subsidising the market, but compared to the counterfactual of not subsidising the market and not doing A I gain money because the rest of my stock is now worth more. It's unclear how I do compared to the most realistic counterfactual of "spend $Y researching action A more deeply and act accordingly".

(note that conditional prediction markets also have incentive issues WRT converging to the correct prices, though I'm not sure how important these are in practice)

Comment by David Johnston (david-johnston) on Uncertainty about the future does not imply that AGI will go well · 2023-06-02T00:53:44.699Z · LW · GW

I don't see how you get default failure without a model. In fact, I don’t see how you get there without the standard model, where an accident means you get a super intelligence with a random goal from an unfriendly prior - but that’s precisely the model that is being contested!

I can kiiinda see default 50-50 as "model free", though I'm not sure if I buy it.

Comment by David Johnston (david-johnston) on Is behavioral safety "solved" in non-adversarial conditions? · 2023-05-27T05:13:56.369Z · LW · GW

You raise some examples of the generator/critic gap, which I addressed. I’m not sure what I should look for in that paper - I mentioned the miscalibration of GPT4 after RLHF, that’s from the GPT4 tech report, and I don’t believe your linked paper shows anything analogous (ie that RLHFd models are less calibrated than they “should” be). I know that the two papers here investigate different notions of calibration.

“Always say true things” is a much higher standard than “don’t do anything obviously bad”. Hallucination is obviously a violation of the first, and it might be a violation of the second - but I just don’t think it’s obvious!

Sure, but the “lying” probably stems from the fact that to get the thumbs up from RLHF you just have to make up a believable answer (because the process AFAIK didn’t involve actual experts in various fields fact checking every tiny bit). If just a handful of “wrong but believable” examples sneak in the reward modelling phase you get a model that thinks that sometimes lying is what humans want (and without getting too edgy, this is totally true for politically charged questions!)."Lying" could well be the better policy! I am not claiming that GPT is maliciously lying, but in AI safety, malice is never really needed or even considered (ok, maybe deception is malicious by definition).

One thing I'm saying is that we don't have clear evidence to support this claim.

Comment by David Johnston (david-johnston) on Is behavioral safety "solved" in non-adversarial conditions? · 2023-05-26T22:45:01.916Z · LW · GW

I don’t agree. There is a distinction between lying and being confused - when you lie, you have to know better. Offering a confused answer is in a sense bad, but with lying there’s an obviously better policy (don’t) while it’s not the case that a confused answer is always the result of a suboptimal policy. When you are confused, the right course of action sometimes results in mistakes.

AFAIK there’s no evidence of a gap between what GPT knows and what it says when it’s running in pure generative mode (though this doesn’t say much; one would have to be quite clever to demonstrate it). There is a generator/discriminator and generator/critical gap, but this is because GPT operating as a critic is simply more capable than GPT as a generator. If we compare apples to apples then there’s again no evidence I know of that RLHFd critic-GPT is holding back on things it knows.

So I don’t think hallucination makes it obvious that behavioural safety is not solved.

I do think the fact that RLHFd models are miscalibrated is evidence against RLHF solving behaviour safety, because calibration is obviously good and the base model was capable of it.

Comment by David Johnston (david-johnston) on Let’s use AI to harden human defenses against AI manipulation · 2023-05-18T09:35:30.181Z · LW · GW

I think this is an interesting proposal. It strikes me as something that is most likely to be useful against “scalable deception” (“misinformation”), and given the utility of scalable deception such technologies might be developed anyway. I think you do need to check if this will lead to deception technologies being developed that would not otherwise have been, and if so whether we’re actually better off knowing about them (this is analogous to one of the cases against gain of function research: we might be better if not knowing how to make highly enhanced viruses).

Comment by David Johnston (david-johnston) on Bayesian Networks Aren't Necessarily Causal · 2023-05-14T06:31:40.220Z · LW · GW

I have a paper (planning to get it on arxiv any day now…) which contains a result: independence of causal mechanisms (which can be related to Occam’s razor & your first point here) + precedent (“things I can do have been done before”) + variety (related to your second point - we’ve observed the phenomena in a meaningfully varied range of circumstances) + conditional independence (which OP used to construct the Bayes net) implies a conditional distribution invariant under action.

That is, speaking very loosely, if you add your considerations to OPs recipe for Bayes nets and the assumption of precedent, you can derive something kinda like interventions.

Comment by David Johnston (david-johnston) on When is Goodhart catastrophic? · 2023-05-10T23:57:32.060Z · LW · GW

Maybe it’s similar, but high U is not necessary

Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-09T13:15:33.967Z · LW · GW

Thanks for explaining the way to do exhaustive search - a big network can exhaustively search smaller network configurations. I believe that.

However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.

Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.

Comment by David Johnston (david-johnston) on When is Goodhart catastrophic? · 2023-05-09T12:29:42.081Z · LW · GW

I think there is an additional effect related to "optimization is not conditioning" that stems from the fact that causation is not correlation. Suppose for argument's sake that people evaluate alignment research partly based on where it's come from (which the machine cannot control). Then producing good alignment research by regular standards is not enough to get high ratings. If a system manages to get good ratings anyway, then the actual papers it's producing must be quite different to typical highly rated alignment papers, because they are somehow compensating for the penalty incurred by coming from the wrong source. In such a situation, I think it would not be surprising if the previously observed relationship between ratings and quality did not continue to hold.

This is similar to "causal Goodhart" in Garrabrant's taxonomy, but I don't think it's quite identical. It's ambiguous whether ratings are being "intervened on" in this situation, and actual quality is probably going to be affected somewhat. I could see it as a generalised version of causal Goodhart, where intervening on the proxy is what happens when this effect is particularly extreme.

Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-09T01:56:31.689Z · LW · GW

they can obviously encode a binary circuit equivalent to a CPU

A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?

as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search

Sorry, I'm being slow here:

  • Solomonoff does exhaustive search for any amount of data; is part of your claim that as data -> infinity, NN + SGD -> Solomonoff?
  • How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?
Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-09T01:27:56.017Z · LW · GW

Neural networks being universal approximators doesn't mean they do as well at distributing uncertainty as Solomonoff, right (I'm not entirely sure about this)? Also, are practical neural nets actually close to being universal?

in the worst case you can recover exhaustive exploration ala solomonoff

Do you mean that this is possible in principle, or that this is a limit of SGD training?

known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD

I read your original claim as "SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small". Experimental evidence that SGD performs competitively does not substantiate that claim, in my view.

Comment by David Johnston (david-johnston) on Inference Speed is Not Unbounded · 2023-05-08T23:57:19.522Z · LW · GW

Do you have a link to a more in-depth defense of this claim?

Comment by David Johnston (david-johnston) on An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility · 2023-05-03T08:30:06.100Z · LW · GW

I’m not convinced the indifference conditions are desirable. Shutdown can be evidence of low utility

Comment by David Johnston (david-johnston) on Hell is Game Theory Folk Theorems · 2023-05-01T12:56:34.272Z · LW · GW

I can see why feasibility + individual rationality makes a payoff profile more likely than any profile missing one of these conditions, but I can’t see why I should consider every profile satisfying these conditions as likely enough to be worth worrying about

Comment by David Johnston (david-johnston) on AI doom from an LLM-plateau-ist perspective · 2023-04-27T21:57:34.748Z · LW · GW

Why? The biggest problem in my mind is algorithmic progress. If we’re outside (C), then the “critical path to TAI” right now is algorithmic progress

Given that outside C approaches to AGI are likely to be substantially unlike anything we’re familiar with, and that controllable AGI is desirable, don’t you think that there’s a good chance these unknown algorithms have favourable control properties?

I think LLMs have some nice control properties too, not so much arguing against LLMs being better than unknown, just the idea that we should confidently expect control to be hard for unknown algorithms.

Comment by David Johnston (david-johnston) on grey goo is unlikely · 2023-04-17T22:12:16.879Z · LW · GW

One of the contentions of this post is that life has thoroughly explored the space of nanotech possibilities. This hypothesis makes the failures of novel nanotech proposals non independent. That said, I don’t think the post offers enough evidence to be highly confident in this proposition (the author might privately know enough to be more confident, but if so it’s not all in the post).

Separately, I can see myself thinking, when all is said and done, that Yudkowsky and Drexler are less reliable about nanotech than I previously thought (which was a modest level of reliability to begin with), even if there are some possibilities for novel nanotech missed or dismissed by this post. Though I think not everything has been said yet.

Comment by David Johnston (david-johnston) on GPTs are Predictors, not Imitators · 2023-04-10T00:45:54.169Z · LW · GW

I was just trying to clarify the limits of autoregressive vs other learning methods. Autoregressive learning is at an apparent disadvantage if is hard to compute and the reverse is easy and low entropy. It can “make up for this” somewhat if it can do a good job of predicting from , but it’s still at a disadvantage if, for example, that’s relatively high entropy compared to from . That’s it, I’m satisfied.