Posts

Why Large Bureaucratic Organizations? 2024-08-27T18:30:07.422Z
... Wait, our models of semantics should inform fluid mechanics?!? 2024-08-26T16:38:53.924Z
Interoperable High Level Structures: Early Thoughts on Adjectives 2024-08-22T21:12:38.223Z
A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed 2024-08-22T19:19:28.940Z
What is "True Love"? 2024-08-18T16:05:47.358Z
Some Unorthodox Ways To Achieve High GDP Growth 2024-08-08T18:58:56.046Z
A Simple Toy Coherence Theorem 2024-08-02T17:47:50.642Z
A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication 2024-07-26T00:33:42.000Z
(Approximately) Deterministic Natural Latents 2024-07-19T23:02:12.306Z
Dialogue on What It Means For Something to Have A Function/Purpose 2024-07-15T16:28:56.609Z
3C's: A Recipe For Mathing Concepts 2024-07-03T01:06:11.944Z
Corrigibility = Tool-ness? 2024-06-28T01:19:48.883Z
What is a Tool? 2024-06-25T23:40:07.483Z
Towards a Less Bullshit Model of Semantics 2024-06-17T15:51:06.060Z
My AI Model Delta Compared To Christiano 2024-06-12T18:19:44.768Z
My AI Model Delta Compared To Yudkowsky 2024-06-10T16:12:53.179Z
Natural Latents Are Not Robust To Tiny Mixtures 2024-06-07T18:53:36.643Z
Calculating Natural Latents via Resampling 2024-06-06T00:37:42.127Z
Value Claims (In Particular) Are Usually Bullshit 2024-05-30T06:26:21.151Z
When Are Circular Definitions A Problem? 2024-05-28T20:00:23.408Z
Why Care About Natural Latents? 2024-05-09T23:14:30.626Z
Some Experiments I'd Like Someone To Try With An Amnestic 2024-05-04T22:04:19.692Z
Examples of Highly Counterfactual Discoveries? 2024-04-23T22:19:19.399Z
Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer 2024-04-18T00:27:43.451Z
Generalized Stat Mech: The Boltzmann Approach 2024-04-12T17:47:31.880Z
How We Picture Bayesian Agents 2024-04-08T18:12:48.595Z
Coherence of Caches and Agents 2024-04-01T23:04:31.320Z
Natural Latents: The Concepts 2024-03-20T18:21:19.878Z
The Worst Form Of Government (Except For Everything Else We've Tried) 2024-03-17T18:11:38.374Z
The Parable Of The Fallen Pendulum - Part 2 2024-03-12T21:41:30.180Z
The Parable Of The Fallen Pendulum - Part 1 2024-03-01T00:25:00.111Z
Leading The Parade 2024-01-31T22:39:56.499Z
A Shutdown Problem Proposal 2024-01-21T18:12:48.664Z
Some Vacation Photos 2024-01-04T17:15:01.187Z
Apologizing is a Core Rationalist Skill 2024-01-02T17:47:35.950Z
The Plan - 2023 Version 2023-12-29T23:34:19.651Z
Natural Latents: The Math 2023-12-27T19:03:01.923Z
Talk: "AI Would Be A Lot Less Alarming If We Understood Agents" 2023-12-17T23:46:32.814Z
Principles For Product Liability (With Application To AI) 2023-12-10T21:27:41.403Z
What I Would Do If I Were Working On AI Governance 2023-12-08T06:43:42.565Z
On Trust 2023-12-06T19:19:07.680Z
Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" 2023-11-21T17:39:17.828Z
On the lethality of biased human reward ratings 2023-11-17T18:59:02.303Z
Some Rules for an Algebra of Bayes Nets 2023-11-16T23:53:11.650Z
Symbol/Referent Confusions in Language Model Alignment Experiments 2023-10-26T19:49:00.718Z
What's Hard About The Shutdown Problem 2023-10-20T21:13:27.624Z
Trying to understand John Wentworth's research agenda 2023-10-20T00:05:40.929Z
Bids To Defer On Value Judgements 2023-09-29T17:07:25.834Z
Inside Views, Impostor Syndrome, and the Great LARP 2023-09-25T16:08:17.040Z
Atoms to Agents Proto-Lectures 2023-09-22T06:22:05.456Z

Comments

Comment by johnswentworth on Why Large Bureaucratic Organizations? · 2024-08-30T01:45:03.059Z · LW · GW

But there are many counter examples of this not being a real concept. See here for many of them: https://www.thediff.co/archive/bullshit-jobs-is-a-terrible-curiosity-killing-concept/

That link has lots of argument against Graeber's particular models and methodology, but doesn't actually seem to argue that much against bullshit jobs as a concept. Indeed, it explicitly endorses to some extent the sort of model used in this post in various places (like e.g. explicitly calling out corporate empire-building as a thing which actually happens). For instance, this example:

The fake job in question is basically a contribution to that glamour: a receptionist who doesn't have much work to do. But this could end up being a money-saving proposition if the company is able to attract workers, and pay them less, by treating the presence of an assistant as a perk.

Comment by johnswentworth on ... Wait, our models of semantics should inform fluid mechanics?!? · 2024-08-28T16:51:14.983Z · LW · GW

First things first:

My current working model of the essential "details AND limits" of human mental existence puts a lot of practical weight and interest on valproic acid because of the paper "Valproate reopens critical-period learning of absolute pitch".

This is fascinating and I would love to hear about anything else you know of a similar flavor.

As for the meat of the comment...

I think this comment didn't really get at the main claim from the post. The key distinction I think it's maybe missing is between:

  • Concepts which no humans have assigned words/phrases to, vs
  • Types of concepts which no humans have assigned a type of word/phrase to

So for instance, nemawashi is a concept which doesn't have a word in English, but it's of a type which is present in English - i.e. it's a pretty ordinary verb, works pretty much like other verbs, if imported into English it could be treated grammatically like a verb without any issues, etc.

I do like your hypothesis that there are concepts which humans motivatedly-avoid giving words to, but that hypothesis is largely orthogonal to the question of whether there are whole types of concepts which don't have corresponding word/phrase types, e.g. a concept which would require not just new words but whole new grammatical rules in order to use in language.

Ithkuil, on the other hand, sounds like it maybe could have some evidence of whole different types of concepts.

Comment by johnswentworth on Why Large Bureaucratic Organizations? · 2024-08-28T15:44:28.169Z · LW · GW

... is that why this post has had unusually many downvotes? Goddammit, I was just trying to convey how and why I found the question interesting and the phenomenon confusing. Heck, I'm not even necessarily claiming the Wentworld equilibrium would be better overall.

Comment by johnswentworth on Why Large Bureaucratic Organizations? · 2024-08-28T15:41:29.727Z · LW · GW

The main testable-in-principle predictions are that economic profits causally drive hiring in large orgs (as opposed to hiring causing economic profits), and that orgs tend to expand until all the economic profit is eaten up (as opposed to expanding until marginal cost of a hire exceeds marginal revenue/savings from a hire). Actually checking those hypotheses statistically would be a pretty involved project; subtle details of accounting tend to end up relevant to this sort of thing, and the causality checks are nontrivial. But it's the sort of thing economists have tools to test.

Comment by johnswentworth on Why Large Bureaucratic Organizations? · 2024-08-28T05:44:14.306Z · LW · GW

It did happen in Wentworld, the resulting corporate structure just doesn't look suspiciously hierarchical, and the corporate culture doesn't look suspiciously heavy on dominance/submission.

Hard to know the full story of what it would look like instead, but I'd guess the nominal duties of Earth!management would be replaced with a lot more reliance on people specialized in horizontal coordination/communication rather hierarchical command & control, plus a lot more paying for results rather than flat salary (though that introduces its own set of problems, which Wentworlders would see as one of the usual main challenges of scaling a company).

Comment by johnswentworth on Secular interpretations of core perennialist claims · 2024-08-26T15:57:43.630Z · LW · GW

A big problem with this post is that I don't have a clear idea of "tanha" is/isn't, so can't really tell how broad various claims are. With that in mind, I want to lay out the closest sane-sounding interpretation I see of that section, and hopefully get feedback on what that interpretation does/doesn't capture about the points you're trying to make.

Jaynes talks about the "mind projection fallacy", in which people interpret subjective aspects of their own models as properties of the world. An example: people interpret their own lack of knowledge/understanding about a phenomenon as the phenomenon itself being inherently mysterious or irreducibly complex. I think mind projection especially happens with value judgements - i.e. people treat "goodness" or "badness" as properties of things out in the world.

Cognitively speaking, treating value as a property of stuff in the world can be useful for planning: if I notice that e.g. one extra counterfactual gallon of milk would be high-value (where the counterfactual intuitively says "all else equal"), then I go look for plans which get me that extra gallon of milk, and I can factor that search apart from much of the rest of my planning-process. But the flip side of assigning value to counterfactuals over stuff-in-the-world is fabricated options: I do not actually have the ability to make a gallon of milk magically appear before me without doing anything else, that's a fabricated option useful as an intermediate cognitive step in planning, it's not a real option actually available to me. The only things a real plan can counterfact over are my own actions, and only insofar as those actions are within my realistic possibility space.

Your section on "tanha" sounds roughly like projecting value into the world, and then mentally latching on to an attractive high-value fabricated option.

How well does that capture the thing you're trying to point to?

Comment by johnswentworth on Coherence of Caches and Agents · 2024-08-25T16:34:14.377Z · LW · GW

"Reward function" is a much more general term, which IMO has been overused to the point where it arguably doesn't even have a clear meaning. "Utility function" is less general: it always connotes an optimization objective, something which is being optimized for directly. And that basically matches the usage here.

Comment by johnswentworth on If we solve alignment, do we die anyway? · 2024-08-23T13:54:36.139Z · LW · GW
  • If takeoff is slow-ish, a pivotal act (preventing more AGIs from being developed) will be difficult.
  • If no pivotal act is performed, RSI-capable AGI proliferates. This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, wins.

These two points seem to be in direct conflict. The sorts of capabilities and winner-take-all underlying dynamics which would make "the first to attack wins" true are also exactly the sorts of capabilities and winner-take-all dynamics which would make a pivotal act tractable.

Or, to put it differently: the first "attack" (though might not look very "attack"-like) is the pivotal act; if the first attack wins, that means the pivotal act worked, and therefore wasn't that difficult. Conversely, if a pivotal act is too hard, then even if an AI attacks first and wins, it has no ability prevent new AI from being built and displacing it; if it did have that ability, then the attack would be a pivotal act.

Comment by johnswentworth on A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed · 2024-08-22T23:34:56.202Z · LW · GW

Nailed it, well done.

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-08-14T16:48:18.796Z · LW · GW

Yeah, this is an open problem that's on my radar. I currently have two main potential threads on it.

First thread: treat each bit in the representation of quantities as distinct random variables, so that e.g. the higher-order and lower-order bits are separate. Then presumably there will often be good approximate natural latents (and higher-level abstract structures) over the higher-order bits, moreso than the lower-order bits. I would say this is the most obvious starting point, but it also has a major drawback: "bits" of a binary number representation are an extremely artificial ontological choice for purposes of this problem. I'd strongly prefer an approach in which magnitudes drop out more naturally.

Thus the second thread: maxent. It continues to seem like there's probably a natural way to view natural latents in a maxent form, which would involve numerically-valued natural "features" that get added together. That would provide a much less artificial notion of magnitude. However, it requires figuring out the maxent thing for natural latents, which I've tried and failed at several times now (though with progress each time).

Comment by johnswentworth on Being the (Pareto) Best in the World · 2024-08-13T18:29:54.880Z · LW · GW

Ooh, fun. Ever tried fitting a stochastic general equilibrium model to data on protein expression dynamics or metabolites or anything like that?

Comment by johnswentworth on Computational irreducibility challenges the simulation hypothesis · 2024-08-12T16:21:21.747Z · LW · GW

I buy that, insofar as the use-case for simulation actually requires predicting the full state of chaotic systems far into the future. But our actual use-cases for simulation don't generally require that. For instance, presumably there is ample incentive to simulate turbulent fluid dynamics inside a jet engine, even though the tiny eddies realized in any run of the physical engine will not exactly match the tiny eddies realized in any run of the simulated engine. For engineering applications, sampling from the distribution is usually fine.

From a theoretical perspective: the reason samples are usually fine for engineering purposes is because we want our designs to work consistently. If a design fails one in n times, then with very high probability it only takes O(n) random samples in order to find a case where the design fails, and that provides the feedback needed from the simulation.

More generally, insofar as a system is chaotic and therefore dependent on quantum randomness, the distribution is in fact the main thing I want to know, and I can get a reasonable look at the distribution by sampling from it a few times.

Comment by johnswentworth on Computational irreducibility challenges the simulation hypothesis · 2024-08-11T20:58:26.210Z · LW · GW

... the computational irreducibility of most existing phenomena...

This part strikes me as the main weak point of the argument. Even if "most" computations, for some sense of "most", are irreducible, the extremely vast majority of physical phenomena in our own universe are extremely reducible computationally, especially if we're willing to randomly sample a trajectory when the system is chaotic.

Just looking around my room right now:

  • The large majority of the room's contents are solid objects just sitting in place (relative to Earth's surface), with some random thermal vibrations which a simulation would presumably sample.
  • Then there's the air, which should be efficiently simulable with an adaptive grid method.
  • There's my computer, which can of course be modeled very well as embedding a certain abstract computational machine, and when something in the environment violates that model a simulation could switch over to the lower level.
  • Of course the most complicated thing in the room is probably me, and I would require a whole complicated stack of software to simulate efficiently. But even with today's relatively primitive simulation technology, multiscale modeling is a thriving topic of research: the dream is to e.g. use molecular dynamics to find reaction kinetics, then reaction kinetics at the cell-scale to find evolution rules for cells and signalling states, then cell and signalling evolution rules to simulate whole organs efficiently, etc.
Comment by johnswentworth on Some Unorthodox Ways To Achieve High GDP Growth · 2024-08-09T05:58:42.229Z · LW · GW

Using prices from a constant reference year, i.e. the way GDP used to be calculated, achieves loop-invariance. We kicked around some other ideas after figuring this out, but didn't figure out any which seemed practical, and also didn't disprove the possibility.

Comment by johnswentworth on Does VETLM solve AI superalignment? · 2024-08-08T19:39:19.954Z · LW · GW

Well, we have lots of implementable proposals. What do VELM and VETLM offer which those other implementable proposals don't? And what problems do VELM and VETLM not solve?

Alternatively: what's the combination of problems which these solutions solve, which nothing else we've thought of simultaneously solves?

Comment by johnswentworth on Does VETLM solve AI superalignment? · 2024-08-08T19:23:25.311Z · LW · GW

It's not really about whether the specific proposal is novel, it's about whether the proposal handles the known barriers which are most difficult for other proposals. New proposals are useful mainly insofar as they overcome some subset of barriers which stopped other solutions.

For instance, if you read through Eliezer's List O' Doom and find that your proposal handles items on that list which no other proposal has ever handled, or a combination which no other proposal has simultaneously handled, then that's a big deal. On the other hand, if your solution falls prey to the same subset of problems as most solutions, then that's not so useful.

Comment by johnswentworth on Rant on Problem Factorization for Alignment · 2024-08-07T14:42:58.352Z · LW · GW

The most recent thing I've seen on the topic is this post from yesterday on debate, which found that debate does basically nothing. In fairness there have also been some nominally-positive studies (which the linked post also mentions), though IMO their setup is more artificial and their effect sizes are not very compelling anyway.

My qualitative impression is that HCH/debate/etc have dropped somewhat in relative excitement as alignment strategies over the past year or so, more so than I expected. People have noticed the unimpressive results to some extent, and also other topics (e.g. mechinterp, SAEs) have gained a lot of excitement. That said, I do still get the impression that there's a steady stream of newcomers getting interested in it.

Comment by johnswentworth on Value fragility and AI takeover · 2024-08-07T03:29:02.405Z · LW · GW

I think section 3 still mostly stands, but the arguments to get there change mildly. Section 4 changes a lot more: the distinction between "A's values, according to A" vs "A's values, according to B" becomes crucial - i.e. A may have a very different idea than B of what it means for A's values to be satisfied in extreme out-of-distribution contexts. In the hard version of the problem, there isn't any clear privileged notion of what "A's values, according to A" would even mean far out-of-distribution.

Comment by johnswentworth on Value fragility and AI takeover · 2024-08-06T01:01:57.554Z · LW · GW

A key point (from here) about value fragility which I think this post is importantly missing: Goodhart problems are about generalization, not approximation.

Suppose I have a proxy  for a true utility function , and  is always within  of u (i.e. ). I maximize . Then the true utility  achieved will be within  of the maximum achievable utility. Reasoning: in the worst case,  is  lower than  at the -maximizing point, and  higher than  at the -maximizing point.

Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.

When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That's a generalization problem, not an approximation problem.

So approximations are fine, so long as they generalize well.

Comment by johnswentworth on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-05T20:41:03.249Z · LW · GW

Right, so there's room here for a burden-of-proof disagreement - i.e. you find it unlikely on priors that a single distribution can accurately capture realistic states-of-knowledge, I don't find it unlikely on priors.

If we've arrived at a burden-of-proof disagreement, then I'd say that's sufficient to back up my answer at top-of-thread:

both imprecise probabilities and maximality seem like ad-hoc, unmotivated methods which add complexity to Bayesian reasoning for no particularly compelling reason.

I said I don't know of any compelling reason - i.e. positive argument, beyond just "this seems unlikely to Anthony and some other people on priors" - to add this extra piece to Bayesian reasoning. And indeed, I still don't. Which does not mean that I necessarily expect you to be convinced that we don't need that extra piece; I haven't spelled out a positive argument here either.

Comment by johnswentworth on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-05T17:01:36.948Z · LW · GW

"No single prior seems to accurately represent our actual state of knowledge/ignorance" is a really ridiculously strong claim, and one which should be provable/disprovable by starting from some qualitative observations about the state of knowledge/ignorance in question. But I've never seen someone advocate for imprecise probabilities by actually making that case.

Let me illustrate a bit how I imagine this would go, and how strong a case would need to be made.

Let's take the simple example of a biased coin with unknown bias. A strawman imprecise-probabilist might argue something like: "If the coin has probability  of landing heads, then after  flips (for some large-ish ) I expect to see roughly  (plus or minus ) heads. But for any particular number , that's not actually what I expect a-priori, because I don't know which  is right - e.g. I don't actually confidently expect to see roughly  heads a priori. Therefore no distribution can represent my state of knowledge.".

... and then the obvious Bayesian response would be: "Sure, if you're artificially restricting your space of distributions/probabilistic models to IID distributions of coin flips. But our actual prior is not in that space; our actual prior involves a latent variable (the bias), and the coin flips are not independent if we don't know the bias (since seeing one outcome tells us something about the bias, which in turn tells us something about the other coin flips). We can represent our prior state of knowledge in this problem just fine with a distribution over the bias.".

Now, the imprecise probabilist could perhaps argue against that by pointing out some other properties of our state of knowledge, and then arguing that no distribution can represent our prior state of knowledge over all the coin flips, no matter how much we introduce latent variables. But that's a much stronger claim, a much harder case to make, and I have no idea what properties of our state of knowledge one would even start from in order to argue for it. On the other hand, I do know of various sets of properties of our state-of-knowledge which are sufficient to conclude that it can be accurately represented by a single prior distribution - e.g. the preconditions of Cox' Theorem, or the preconditions for the Dutch Book theorems (if our hypothetical agent is willing to make bets on its priors).

Comment by johnswentworth on A Simple Toy Coherence Theorem · 2024-08-05T16:46:50.562Z · LW · GW

Obviously implicitly assuming we're restricting to some interesting set of resources, because otherwise we can reframe any myopic maximizer as long-term and vice-versa.

This part I think is false. The theorem in this post does not need any notion of resources, and neither does Utility Maximization = Description Length Minimization. We do need a notion of spacetime (in order to talk about stuff far away in space/time), but that's a much weaker ontological assumption.

Comment by johnswentworth on A Simple Toy Coherence Theorem · 2024-08-04T18:31:19.734Z · LW · GW

In terms of the OP toy model, I think the OP omitted another condition under which the coherence theorem is trivial / doesn’t apply, which is that you always start the MDP in the same place and the MDP graph is a directed tree or directed forest. (i.e., there are no cycles even if you ignore the arrow-heads … I’m hope I’m getting the graph theory terminology right). In those cases, for any possible end-state, there’s at most one way to get from the start to the end-state; and conversely, for any possible path through the MDP, that’s the path that would result from wanting to get to that end-state. Therefore, you can rationalize any path through the MDP as the optimal way to get to whatever end-state it actually gets to. Right?

Technically correct.

I'd emphasize here that this toy theorem is assuming an MDP, which specifically means that the "agent" must be able to observe the entire state at every timestep. If you start thinking about low-level physics and microscopic reversibility, then the entire state is definitely not observable by real agents. In order to properly handle that sort of thing, we'd mostly need to add uncertainty, i.e. shift to POMDPs.

Comment by johnswentworth on A Simple Toy Coherence Theorem · 2024-08-04T18:26:30.623Z · LW · GW

Well sure, you can model anything as a utility maximiser technically, but the resource w.r.t which it's being optimal/the way its preferences are carving up state-space will be incredibly awkward/garbled/unnatural (in the extreme, they could just be utility-maximizing over entire universe-histories). But these are unnatural/trivial. If we add constraints over the kind of resources it's caring about/kinds of outcomes it can have preferences over, we constrain the set of what can be a utility-maximiser a lot. And if we constrain it to smth like the set of resources that we think in terms of, the resulting set of possible utility-maximisers do look scary.

I would guess that response is memetically largely downstream of my own old take. It's not wrong, and it's pretty easy to argue that future systems will in fact behave efficiently with respect to the resources we care about: we'll design/train the system to behave efficiently with respect to those resources precisely because we care about those resources and resource-usage is very legible/measurable. But over the past year or so I've moved away from that frame, and part of the point of this post is to emphasize the frame I usually use now instead.

In that new frame, here's what I would say instead: "Well sure, you can model anything as a utility maximizer technically, but usually any utility function compatible with the system's behavior is very myopic - it mostly just cares about some details of the world "close to" (in time/space) the system itself, and doesn't involve much optimization pressure against most of the world. If a system is to apply much optimization pressure to parts of the world far away from itself - like e.g. make & execute long-term plans - then the system must be a(n approximate) utility maximizer in a much less trivial sense. It must behave like it's maximizing a utility function specifically over stuff far away."

(... actually that's not a thing I'd say, because right from the start I would have said that I'm using utility maximization mainly because it makes it easy to illustrate various problems. Those problems usually remain even when we don't assume utility maximization, they're just a lot less legible without a mathematical framework. But, y'know, for purposes of this discussion...)

Also on the actual theorem you outline here - it looks right, but isn't assuming utilities assigned to outcomes s.t. the agent is trying to maximise over them kind of begging most of the question that coherence theorems are after?

In my head, an important complement to this post is Utility Maximization = Description Length Minimization, which basically argues that "optimization" in the usual Flint/Yudkowsky sense is synonymous with optimizing some utility function over the part of the world being optimized. However, that post doesn't involve an optimizer; it just talks about stuff "being optimized" in a way which may or may not involve a separate thing which "does the optimization".

This post adds the optimizer to that picture. We start from utility maximization over some "far away" stuff, in order to express optimization occurring over that far away stuff. Then we can ask "but what's being adjusted to do that optimization?", i.e. in the problem  what's ? And if  is the "policy" of some system, such that the whole setup is an MDP, then find that there's a nontrivial sense in which the system can be or not be a (long-range) utility maximizer - i.e. an optimizer.

Comment by johnswentworth on A Simple Toy Coherence Theorem · 2024-08-02T21:05:12.385Z · LW · GW

That all looks correct.

Comment by johnswentworth on A Simple Toy Coherence Theorem · 2024-08-02T19:45:17.404Z · LW · GW

You could extend it that way, and more generally you could extend it to sparse rewards. As the post says, coherence tells us about optimal policies away from the things which the goals care about directly. But in order for the theorem to say something substantive, there has to be lots of "empty space" where the incremental reward is zero. It's in the empty space where coherence has substantive things to say.

Comment by johnswentworth on Natural Latents: The Concepts · 2024-08-02T19:39:24.898Z · LW · GW

That's correct, the difference is just much worse bounds, so for 3 there only exists a natural latent to within a much worse approximation.

Comment by johnswentworth on What are your cruxes for imprecise probabilities / decision rules? · 2024-08-02T16:47:35.085Z · LW · GW

I think this quote nicely summarizes the argument you're asking about:

Not only do we not have evidence of a kind that allows us to know the total consequences of our actions, we seem often to lack evidence of a kind that warrants assigning precise probabilities to relevant states.

This, I would say, sounds like a reasonable critique if one does not really get the idea of Bayesianism. Like, if I put myself in a mindset where I'm only allowed to use probabilities when I have positive evidence which "warrants" those precise probabilities, then sure, it's a reasonable criticism. But a core idea of Bayesianism is that we use probabilities to represent our uncertainties even in the absence of evidence; that's exactly what a prior is. And the point of all the various arguments for Bayesian reasoning is that this is a sensible and consistent way to handle uncertainty, even when the available evidence is weak and we're mostly working off of priors.

As a concrete example, I think of Jaynes' discussion of the widget problem (pg 440 here): one is given some data on averages of a few variables, but not enough to back out the whole joint distribution of the variables from the data, and then various decision/inference problems are posed. This seems like exactly the sort of problem the quote is talking about. Jaynes' response to that problem is not "we lack evidence which warrants assigning precise probabilities", but rather, "we need to rely on priors, so what priors accurately represent our actual state of knowledge/ignorance?".

Point is: for a Bayesian, the point of probabilities is to accurately represent an agent's epistemic state. Whether the probabilities are "warranted by evidence" is a nonsequitur.

Comment by johnswentworth on What are your cruxes for imprecise probabilities / decision rules? · 2024-07-31T20:24:39.997Z · LW · GW

A couple years ago, my answer would have been that both imprecise probabilities and maximality seem like ad-hoc, unmotivated methods which add complexity to Bayesian reasoning for no particularly compelling reason.

I was eventually convinced that they are useful and natural, specifically in the case where the environment contains an adversary (or the agent in question models the environment as containing an adversary, e.g. to obtain worst-case bounds). I now think of that use-case as the main motivation for the infra-Bayes framework, which uses imprecise probabilities and maximization as central tools. More generally, the infra-Bayes approach is probably useful for environments containing other agents.

Comment by johnswentworth on Decomposing Agency — capabilities without desires · 2024-07-31T17:41:14.472Z · LW · GW

You can show that, in order for an agent to persist, it needs to have the capacity to observe and learn about its environment. The math is a more complex than I want to get into here...

Do you have a citation for this? I went looking for the supposed math behind that claim a couple years back, and found one section of one Friston paper which had an example system which did not obviously generalize particularly well, and also used a kinda-hand-wavy notion of "Markov blanket" that didn't make it clear what precisely was being conditioned on (a critique which I would extend to all of the examples you list). And that was it; hundreds of excited citations chained back to that one spot. If anybody's written an actual explanation and/or proof somewhere, that would be great.

Comment by johnswentworth on [deleted post] 2024-07-27T17:16:12.807Z

(I endorse sunwillrise's comment as a general response to this post; it's an unusually excellent comment. This comment is just me harping on a pet peeve of mine.)

So, within the ratosphere, it's well-known that every physical object or set of objects is mathematically equivalent to some expected utility maximizer

This is a wildly misleading idea which refuses to die.

As a meme within the ratosphere, the usual source cited is this old post by Rohin, which has a section titled "All behavior can be rationalized as EU maximization". When I complained to Rohin that "All behavior can be rationalized as EU maximization" was wildly misleading, he replied:

I tried to be clear that my argument was "you need more assumptions beyond just coherence arguments on universe-histories; if you have literally no other assumptions then all behavior can be rationalized as EU maximization". I think the phrase "all behavior can be rationalized as EU maximization" or something like it was basically necessary to get across the argument that I was making. I agree that taken in isolation it is misleading; I don't really see what I could have done differently to prevent there from being something that in isolation was misleading, while still being able to point out the-thing-that-I-believe-is-fallacious. Nuance is hard.

Point is: even the guy who's usually cited on this (at least on LW) agrees it's misleading.

Why is it misleading? Because coherence arguments do, in fact, involve a notion of "utility maximization" narrower than just a system's behavior maximizing some function of universe-trajectory. There are substantive notions of "utility maximizer", those notions are a decent match to our intuitions in many ways, and they involve more than just behavior maximizing some function of universe-trajectory. When we talk about "utility maximizers" in a substantive sense, we're talking about a phenomenon which is narrower than behavior maximizing some function of universe-trajectory.

If you want to see a notion of "utility maximizer" which is nontrivial, Coherence of Caches and Agents gives IMO a pretty illustrative and simple example.

Comment by johnswentworth on johnswentworth's Shortform · 2024-07-23T06:13:00.133Z · LW · GW

+1, and even for those who do buy extinction risk to some degree, financial/status incentives usually have more day-to-day influence on behavior.

Comment by johnswentworth on johnswentworth's Shortform · 2024-07-23T06:11:44.803Z · LW · GW

Good argument, I find this at least somewhat convincing. Though it depends on whether penalty (1), the one capped at 10%/30% of training compute cost, would be applied more than once on the same model if the violation isn't remedied.

Comment by johnswentworth on johnswentworth's Shortform · 2024-07-23T03:18:29.244Z · LW · GW

So I read SB1047.

My main takeaway: the bill is mostly a recipe for regulatory capture, and that's basically unavoidable using anything even remotely similar to the structure of this bill. (To be clear, regulatory capture is not necessarily a bad thing on net in this case.)

During the first few years after the bill goes into effect, companies affected are supposed to write and then implement a plan to address various risks. What happens if the company just writes and implements a plan which sounds vaguely good but will not, in fact, address the various risks? Probably nothing. Or, worse, those symbolic-gesture plans will become the new standard going forward.

In order to avoid this problem, someone at some point would need to (a) have the technical knowledge to evaluate how well the plans actually address the various risks, and (b) have the incentive to actually do so.

Which brings us to the real underlying problem here: there is basically no legible category of person who has the requisite technical knowledge and also the financial/status incentive to evaluate those plans for real.

(The same problem also applies to the board of the new regulatory body, once past the first few years.)

Having noticed that problem as a major bottleneck to useful legislation, I'm now a lot more interested in legal approaches to AI X-risk which focus on catastrophe insurance. That would create a group - the insurers - who are strongly incentivized to acquire the requisite technical skills and then make plans/requirements which actually address some risks.

Comment by johnswentworth on Natural Latents: The Math · 2024-07-18T17:32:26.809Z · LW · GW

So 'latents' are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don't have to always look like , they can look like , etc, right?

The key idea here is that, when "choosing a latent", we're not allowed to choose  is fixed/known/given, a latent is just a helpful tool for reasoning about or representing . So another way to phrase it is: we're choosing our whole model , but with a constraint on the marginal  then captures all of the degrees of freedom we have in choosing a latent.

Now, we won't typically represent the latent explicitly as . Typically we'll choose latents such that  satisfies some factorization(s), and those factorizations will provide a more compact representation of the distribution than two giant tables for . For instance, insofar as  factors as , we might want to represent the distribution as  and  (both for analytic and computational purposes).

I don't get the 'standard form' business.

We've largely moved away from using the standard form anyway, I recommend ignoring it at this point.

Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?

Yup, that post proves the universal natural latent conjecture when strong redundancy holds (over 3 or more variables). Whether the conjecture does not hold when strong redundancy fails is an open question. But since the strong redundancy result we've mostly shifted toward viewing strong redundancy as the usual condition to look for, and focused less on weak redundancy.

Resampling

Also does all this imply that we're starting out assuming that  shares a probability space with all the other possible latents, e.g. ? How does this square with a latent variable being defined by the CPD implicit in the factorization?

We conceptually start with the objects , and . (We're imagining here that two different agents measure the same distribution , but then they each model it using their own latents.) Given only those objects, the joint distribution  is underdefined - indeed, it's unclear what such a joint distribution would even mean! Whose distribution is it?

One simple answer (unsure whether this will end up being the best way to think about it): one agent is trying to reason about the observables , their own latent , and the other agent's latent  simultaneously, e.g. in order to predict whether the other agent's latent is isomorphic to  (as would be useful for communication).

Since  and  are both latents, in order to define , the agent needs to specify . That's where the underdefinition comes in: only  and  were specified up-front. So, we sidestep the problem: we construct a new latent  such that  matches , but  is independent of  given . Then we've specified the whole distribution .

Hopefully that clarifies what the math is, at least. It's still a bit fishy conceptually, and I'm not convinced it's the best way to handle the part it's trying to handle.

Comment by johnswentworth on AI #73: Openly Evil AI · 2024-07-18T17:08:33.304Z · LW · GW

Yeah, it's the "exchange" part which seems to be missing, not the "securities" part.

Comment by johnswentworth on AI #73: Openly Evil AI · 2024-07-18T15:34:58.595Z · LW · GW

Why does the SEC have any authority at all over OpenAI? They're not a publicly listed company, i.e. they're not listed on any securities exchange, so naively one would think a "securities exchange commission" doesn't have much to do with them.

I mean, obviously federal agencies always have scope creep, it's not actually surprising if they have some authority over OpenAI, but what specific legal foundation is the SEC on here? What is their actual scope?

Comment by johnswentworth on Natural Latents: The Math · 2024-07-17T20:43:21.651Z · LW · GW

Consider the exact version of the redundancy condition for latent  over :

and

Combine these two and we get, for all :

 OR 

That's the foundation for a conceptually-simple method for finding the exact natural latent (if one exists) given a distribution :

  • Pick a value  which has nonzero probability, and initialize a set  containing that value. Then we must have  for all .
  • Loop: add to  a new value  or  where the value  or  (respectively) already appears in one of the pairs in . Then  or , respectively. Repeat until there are no more candidate values to add to .
  • Pick a new pair and repeat with a new set, until all values of  have been added to a set.
  • Now take the latent to be the equivalence class in which  falls.

Does that make sense?

Comment by johnswentworth on Dialogue on What It Means For Something to Have A Function/Purpose · 2024-07-16T01:36:16.650Z · LW · GW

Was this intended to respond to any particular point, or just a general observation?

Comment by johnswentworth on Corrigibility = Tool-ness? · 2024-07-15T19:02:11.834Z · LW · GW

My current starting point would be standard methods for decomposing optimization problems, like e.g. the sort covered in this course.

Comment by johnswentworth on Alignment: "Do what I would have wanted you to do" · 2024-07-13T16:26:30.912Z · LW · GW

No, because we have tons of information about what specific kinds of information on the internet is/isn't usually fabricated. It's not like we have no idea at all which internet content is more/less likely to be fabricated.

Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It's standard basic material, there's lots of presentations of it, it's not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that's not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.

On the other end of the spectrum, "fusion power plant blueprints" on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.

Comment by johnswentworth on Alignment: "Do what I would have wanted you to do" · 2024-07-12T22:06:43.933Z · LW · GW

That is not how this works. Let's walk through it for both the "human as clumps of molecules following physics" and the "LLM as next-text-on-internet predictor".

Humans as clumps of molecules following physics

Picture a human attempting to achieve some goal - for concreteness, let's say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket.

Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever.

LLM as next-text-on-internet predictor

Now imagine finding the text "Notes From a Prompt Factory" on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today.

The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it's fiction about a fictional prompt factory. So that's the sort of thing we should expect a highly capable LLM to output following the prompt "Notes From a Prompt Factory": fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction.

It's not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I'm perfectly happy to assume that it is. The issue is that the distribution of text on today's internet which follows the prompt "Notes From a Prompt Factory" is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge from internet authors, and then uses that knowledge to correctly predict that what follows the text "Notes From a Prompt Factory" will not be actual notes from an actual prompt factory.

Comment by johnswentworth on Alignment: "Do what I would have wanted you to do" · 2024-07-12T19:20:26.092Z · LW · GW

Let's assume a base model (i.e. not RLHF'd), since you asserted a way to turn the LM into a goal-driven chatbot via prompt engineering alone. So you put in some prompt, and somewhere in the middle of that prompt is a part which says "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do", for some X.

The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.

Comment by johnswentworth on My AI Model Delta Compared To Yudkowsky · 2024-07-11T02:23:58.334Z · LW · GW

A lot of the particulars of humans' values are heavily reflective. Two examples:

  • A large chunk of humans' terminal values involves their emotional/experience states - happy, sad, in pain, delighted, etc.
  • Humans typically want ~terminally to have some control over their own futures.

Contrast that to e.g. a blue-minimizing robot, which just tries to minimize the amount of blue stuff in the universe. That utility function involves reflection only insofar as the robot is (or isn't) blue.

Comment by johnswentworth on On passing Complete and Honest Ideological Turing Tests (CHITTs) · 2024-07-10T15:58:42.774Z · LW · GW

You have a decent argument for the claim as literally stated here, but I think there's some wrongheaded subtext. To try to highlight it, I'll argue for another claim about the "Complete and Honest Ideological Turing Test" as you've defined it.

Suppose that an advocate of some position would in fact abandon that position if they knew all the evidence or arguments or counterarguments which I might use to argue against it (and observers correctly know this). Then, by your definition, I cannot pass their CHITT - but it's not because I've failed to understand their position, it's because they don't know the things I know.

Suppose that an advocate of some position simply does not have any response which would not make them look irrational to some class of evidence or argument or counterargument which I would use to argue against the position (and observers correctly know this). Then again, by your definition, I cannot pass their CHITT - but again, it's not because I've failed to understand their position, it's because they in fact do not have responses which don't make them look irrational.

The claim these two point toward: as defined, sometimes it is impossible to pass someone's CHITT not because I don't understand their position, but because... well, they're some combination of ignorant and an idiot, and I know where the gaps in their argument are. This is in contrast to the original ITT, which was intended to just check whether I've actually understood somebody else's position.

Making the subtext explicit: it seems like this post is trying to push a worldview in which nobody is ever Just Plain Wrong or Clearly Being An Idiot. And that's not how the real world actually works - most unambiguously, it is sometimes the case that a person would update away from their current position if they knew the evidence/arguments/counterarguments which I would present.

Comment by johnswentworth on Integrating Hidden Variables Improves Approximation · 2024-07-09T23:39:16.417Z · LW · GW

This is particularly interesting if we take  and  to be two different models, and take the indices 1, 2 to be different values of another random variable  with distribution  given by . In that case, the above inequality becomes:

Note to self: this assumes P[Y] = Q[Y].

Comment by johnswentworth on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T21:35:20.913Z · LW · GW

I wasn't imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to "understand a subproblem", so that was useful.

I'll try again:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action how well the AI's plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

(... and presumably an unstated piece here is that "understanding how well the AI's plan/action solves a particular subproblem" might include recursive steps like "here's a sub-sub-problem, assume the AI's actions do a decent job solving that one", where the human might not actually check the sub-sub-problem.)

Does that accurately express the intended message?

Comment by johnswentworth on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T20:13:19.767Z · LW · GW

Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action any particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

Does that accurately express the intended message?
 

Comment by johnswentworth on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T19:00:20.884Z · LW · GW

... situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it...

Can you give an example (toy example is fine) of:

  • an action one might want to understand
  • what plan/strategy/other context that action is a part of
  • what it would look like for a human to understand the action

?

Mostly I'm confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I'm not sure what it would even look like to "understand" that left-turn separate from understanding its whole maze-plan.

Comment by johnswentworth on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-06T15:16:58.761Z · LW · GW

I'd love to see more discussion by more people of the convergence ideas presented in Requirements for a Basin of Attraction to Alignment (and its prequel Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis).

+1, that was an underrated post.