Posts

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication 2024-07-26T00:33:42.000Z
(Approximately) Deterministic Natural Latents 2024-07-19T23:02:12.306Z
Dialogue on What It Means For Something to Have A Function/Purpose 2024-07-15T16:28:56.609Z
3C's: A Recipe For Mathing Concepts 2024-07-03T01:06:11.944Z
Corrigibility = Tool-ness? 2024-06-28T01:19:48.883Z
What is a Tool? 2024-06-25T23:40:07.483Z
Towards a Less Bullshit Model of Semantics 2024-06-17T15:51:06.060Z
My AI Model Delta Compared To Christiano 2024-06-12T18:19:44.768Z
My AI Model Delta Compared To Yudkowsky 2024-06-10T16:12:53.179Z
Natural Latents Are Not Robust To Tiny Mixtures 2024-06-07T18:53:36.643Z
Calculating Natural Latents via Resampling 2024-06-06T00:37:42.127Z
Value Claims (In Particular) Are Usually Bullshit 2024-05-30T06:26:21.151Z
When Are Circular Definitions A Problem? 2024-05-28T20:00:23.408Z
Why Care About Natural Latents? 2024-05-09T23:14:30.626Z
Some Experiments I'd Like Someone To Try With An Amnestic 2024-05-04T22:04:19.692Z
Examples of Highly Counterfactual Discoveries? 2024-04-23T22:19:19.399Z
Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer 2024-04-18T00:27:43.451Z
Generalized Stat Mech: The Boltzmann Approach 2024-04-12T17:47:31.880Z
How We Picture Bayesian Agents 2024-04-08T18:12:48.595Z
Coherence of Caches and Agents 2024-04-01T23:04:31.320Z
Natural Latents: The Concepts 2024-03-20T18:21:19.878Z
The Worst Form Of Government (Except For Everything Else We've Tried) 2024-03-17T18:11:38.374Z
The Parable Of The Fallen Pendulum - Part 2 2024-03-12T21:41:30.180Z
The Parable Of The Fallen Pendulum - Part 1 2024-03-01T00:25:00.111Z
Leading The Parade 2024-01-31T22:39:56.499Z
A Shutdown Problem Proposal 2024-01-21T18:12:48.664Z
Some Vacation Photos 2024-01-04T17:15:01.187Z
Apologizing is a Core Rationalist Skill 2024-01-02T17:47:35.950Z
The Plan - 2023 Version 2023-12-29T23:34:19.651Z
Natural Latents: The Math 2023-12-27T19:03:01.923Z
Talk: "AI Would Be A Lot Less Alarming If We Understood Agents" 2023-12-17T23:46:32.814Z
Principles For Product Liability (With Application To AI) 2023-12-10T21:27:41.403Z
What I Would Do If I Were Working On AI Governance 2023-12-08T06:43:42.565Z
On Trust 2023-12-06T19:19:07.680Z
Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" 2023-11-21T17:39:17.828Z
On the lethality of biased human reward ratings 2023-11-17T18:59:02.303Z
Some Rules for an Algebra of Bayes Nets 2023-11-16T23:53:11.650Z
Symbol/Referent Confusions in Language Model Alignment Experiments 2023-10-26T19:49:00.718Z
What's Hard About The Shutdown Problem 2023-10-20T21:13:27.624Z
Trying to understand John Wentworth's research agenda 2023-10-20T00:05:40.929Z
Bids To Defer On Value Judgements 2023-09-29T17:07:25.834Z
Inside Views, Impostor Syndrome, and the Great LARP 2023-09-25T16:08:17.040Z
Atoms to Agents Proto-Lectures 2023-09-22T06:22:05.456Z
What's A "Market"? 2023-08-08T23:29:24.722Z
Yes, It's Subjective, But Why All The Crabs? 2023-07-28T19:35:36.741Z
Alignment Grantmaking is Funding-Limited Right Now 2023-07-19T16:49:08.811Z
Why Not Subagents? 2023-06-22T22:16:55.249Z
Lessons On How To Get Things Right On The First Try 2023-06-19T23:58:09.605Z
Algorithmic Improvement Is Probably Faster Than Scaling Now 2023-06-06T02:57:33.700Z
$500 Bounty/Prize Problem: Channel Capacity Using "Insensitive" Functions 2023-05-16T21:31:35.490Z

Comments

Comment by johnswentworth on johnswentworth's Shortform · 2024-07-23T06:13:00.133Z · LW · GW

+1, and even for those who do buy extinction risk to some degree, financial/status incentives usually have more day-to-day influence on behavior.

Comment by johnswentworth on johnswentworth's Shortform · 2024-07-23T06:11:44.803Z · LW · GW

Good argument, I find this at least somewhat convincing. Though it depends on whether penalty (1), the one capped at 10%/30% of training compute cost, would be applied more than once on the same model if the violation isn't remedied.

Comment by johnswentworth on johnswentworth's Shortform · 2024-07-23T03:18:29.244Z · LW · GW

So I read SB1047.

My main takeaway: the bill is mostly a recipe for regulatory capture, and that's basically unavoidable using anything even remotely similar to the structure of this bill. (To be clear, regulatory capture is not necessarily a bad thing on net in this case.)

During the first few years after the bill goes into effect, companies affected are supposed to write and then implement a plan to address various risks. What happens if the company just writes and implements a plan which sounds vaguely good but will not, in fact, address the various risks? Probably nothing. Or, worse, those symbolic-gesture plans will become the new standard going forward.

In order to avoid this problem, someone at some point would need to (a) have the technical knowledge to evaluate how well the plans actually address the various risks, and (b) have the incentive to actually do so.

Which brings us to the real underlying problem here: there is basically no legible category of person who has the requisite technical knowledge and also the financial/status incentive to evaluate those plans for real.

(The same problem also applies to the board of the new regulatory body, once past the first few years.)

Having noticed that problem as a major bottleneck to useful legislation, I'm now a lot more interested in legal approaches to AI X-risk which focus on catastrophe insurance. That would create a group - the insurers - who are strongly incentivized to acquire the requisite technical skills and then make plans/requirements which actually address some risks.

Comment by johnswentworth on Natural Latents: The Math · 2024-07-18T17:32:26.809Z · LW · GW

So 'latents' are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don't have to always look like , they can look like , etc, right?

The key idea here is that, when "choosing a latent", we're not allowed to choose  is fixed/known/given, a latent is just a helpful tool for reasoning about or representing . So another way to phrase it is: we're choosing our whole model , but with a constraint on the marginal  then captures all of the degrees of freedom we have in choosing a latent.

Now, we won't typically represent the latent explicitly as . Typically we'll choose latents such that  satisfies some factorization(s), and those factorizations will provide a more compact representation of the distribution than two giant tables for . For instance, insofar as  factors as , we might want to represent the distribution as  and  (both for analytic and computational purposes).

I don't get the 'standard form' business.

We've largely moved away from using the standard form anyway, I recommend ignoring it at this point.

Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?

Yup, that post proves the universal natural latent conjecture when strong redundancy holds (over 3 or more variables). Whether the conjecture does not hold when strong redundancy fails is an open question. But since the strong redundancy result we've mostly shifted toward viewing strong redundancy as the usual condition to look for, and focused less on weak redundancy.

Resampling

Also does all this imply that we're starting out assuming that  shares a probability space with all the other possible latents, e.g. ? How does this square with a latent variable being defined by the CPD implicit in the factorization?

We conceptually start with the objects , and . (We're imagining here that two different agents measure the same distribution , but then they each model it using their own latents.) Given only those objects, the joint distribution  is underdefined - indeed, it's unclear what such a joint distribution would even mean! Whose distribution is it?

One simple answer (unsure whether this will end up being the best way to think about it): one agent is trying to reason about the observables , their own latent , and the other agent's latent  simultaneously, e.g. in order to predict whether the other agent's latent is isomorphic to  (as would be useful for communication).

Since  and  are both latents, in order to define , the agent needs to specify . That's where the underdefinition comes in: only  and  were specified up-front. So, we sidestep the problem: we construct a new latent  such that  matches , but  is independent of  given . Then we've specified the whole distribution .

Hopefully that clarifies what the math is, at least. It's still a bit fishy conceptually, and I'm not convinced it's the best way to handle the part it's trying to handle.

Comment by johnswentworth on AI #73: Openly Evil AI · 2024-07-18T17:08:33.304Z · LW · GW

Yeah, it's the "exchange" part which seems to be missing, not the "securities" part.

Comment by johnswentworth on AI #73: Openly Evil AI · 2024-07-18T15:34:58.595Z · LW · GW

Why does the SEC have any authority at all over OpenAI? They're not a publicly listed company, i.e. they're not listed on any securities exchange, so naively one would think a "securities exchange commission" doesn't have much to do with them.

I mean, obviously federal agencies always have scope creep, it's not actually surprising if they have some authority over OpenAI, but what specific legal foundation is the SEC on here? What is their actual scope?

Comment by johnswentworth on Natural Latents: The Math · 2024-07-17T20:43:21.651Z · LW · GW

Consider the exact version of the redundancy condition for latent  over :

and

Combine these two and we get, for all :

 OR 

That's the foundation for a conceptually-simple method for finding the exact natural latent (if one exists) given a distribution :

  • Pick a value  which has nonzero probability, and initialize a set  containing that value. Then we must have  for all .
  • Loop: add to  a new value  or  where the value  or  (respectively) already appears in one of the pairs in . Then  or , respectively. Repeat until there are no more candidate values to add to .
  • Pick a new pair and repeat with a new set, until all values of  have been added to a set.
  • Now take the latent to be the equivalence class in which  falls.

Does that make sense?

Comment by johnswentworth on Dialogue on What It Means For Something to Have A Function/Purpose · 2024-07-16T01:36:16.650Z · LW · GW

Was this intended to respond to any particular point, or just a general observation?

Comment by johnswentworth on Corrigibility = Tool-ness? · 2024-07-15T19:02:11.834Z · LW · GW

My current starting point would be standard methods for decomposing optimization problems, like e.g. the sort covered in this course.

Comment by johnswentworth on Alignment: "Do what I would have wanted you to do" · 2024-07-13T16:26:30.912Z · LW · GW

No, because we have tons of information about what specific kinds of information on the internet is/isn't usually fabricated. It's not like we have no idea at all which internet content is more/less likely to be fabricated.

Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It's standard basic material, there's lots of presentations of it, it's not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that's not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.

On the other end of the spectrum, "fusion power plant blueprints" on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.

Comment by johnswentworth on Alignment: "Do what I would have wanted you to do" · 2024-07-12T22:06:43.933Z · LW · GW

That is not how this works. Let's walk through it for both the "human as clumps of molecules following physics" and the "LLM as next-text-on-internet predictor".

Humans as clumps of molecules following physics

Picture a human attempting to achieve some goal - for concreteness, let's say the human is trying to pick an apple from a high-up branch on an apple tree. Picture what that human does: they maybe get a ladder, or climb the tree, or whatever. They manage to pluck the apple from the tree and drop it in a basket.

Now, imagine a detailed low-level simulation of the exact same situation: that same human trying to pick that same apple. Modulo quantum noise, what happens in that simulation? What do we see when we look at its outputs? Well, it looks like a human attempting to achieve some goal: the clump of molecules which is a human gets another clump which is a ladder, or climbs the clump which is the tree, or whatever.

LLM as next-text-on-internet predictor

Now imagine finding the text "Notes From a Prompt Factory" on the internet, today (because the LLM is trained on text from ~today). Imagine what text would follow that beginning on the internet today.

The text which follows that beginning on the internet today is not, in fact, notes from a prompt factory. Instead, it's fiction about a fictional prompt factory. So that's the sort of thing we should expect a highly capable LLM to output following the prompt "Notes From a Prompt Factory": fiction. The more capable it is, the more likely it is to correctly realize that that prompt precedes fiction.

It's not a question of whether the LLM is absorbing the explicit and tacit knowledge of internet authors; I'm perfectly happy to assume that it is. The issue is that the distribution of text on today's internet which follows the prompt "Notes From a Prompt Factory" is not the distribution of text which would result from actual notes on an actual prompt factory. The highly capable LLM absorbs all that knowledge from internet authors, and then uses that knowledge to correctly predict that what follows the text "Notes From a Prompt Factory" will not be actual notes from an actual prompt factory.

Comment by johnswentworth on Alignment: "Do what I would have wanted you to do" · 2024-07-12T19:20:26.092Z · LW · GW

Let's assume a base model (i.e. not RLHF'd), since you asserted a way to turn the LM into a goal-driven chatbot via prompt engineering alone. So you put in some prompt, and somewhere in the middle of that prompt is a part which says "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do", for some X.

The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.

Comment by johnswentworth on My AI Model Delta Compared To Yudkowsky · 2024-07-11T02:23:58.334Z · LW · GW

A lot of the particulars of humans' values are heavily reflective. Two examples:

  • A large chunk of humans' terminal values involves their emotional/experience states - happy, sad, in pain, delighted, etc.
  • Humans typically want ~terminally to have some control over their own futures.

Contrast that to e.g. a blue-minimizing robot, which just tries to minimize the amount of blue stuff in the universe. That utility function involves reflection only insofar as the robot is (or isn't) blue.

Comment by johnswentworth on On passing Complete and Honest Ideological Turing Tests (CHITTs) · 2024-07-10T15:58:42.774Z · LW · GW

You have a decent argument for the claim as literally stated here, but I think there's some wrongheaded subtext. To try to highlight it, I'll argue for another claim about the "Complete and Honest Ideological Turing Test" as you've defined it.

Suppose that an advocate of some position would in fact abandon that position if they knew all the evidence or arguments or counterarguments which I might use to argue against it (and observers correctly know this). Then, by your definition, I cannot pass their CHITT - but it's not because I've failed to understand their position, it's because they don't know the things I know.

Suppose that an advocate of some position simply does not have any response which would not make them look irrational to some class of evidence or argument or counterargument which I would use to argue against the position (and observers correctly know this). Then again, by your definition, I cannot pass their CHITT - but again, it's not because I've failed to understand their position, it's because they in fact do not have responses which don't make them look irrational.

The claim these two point toward: as defined, sometimes it is impossible to pass someone's CHITT not because I don't understand their position, but because... well, they're some combination of ignorant and an idiot, and I know where the gaps in their argument are. This is in contrast to the original ITT, which was intended to just check whether I've actually understood somebody else's position.

Making the subtext explicit: it seems like this post is trying to push a worldview in which nobody is ever Just Plain Wrong or Clearly Being An Idiot. And that's not how the real world actually works - most unambiguously, it is sometimes the case that a person would update away from their current position if they knew the evidence/arguments/counterarguments which I would present.

Comment by johnswentworth on Integrating Hidden Variables Improves Approximation · 2024-07-09T23:39:16.417Z · LW · GW

This is particularly interesting if we take  and  to be two different models, and take the indices 1, 2 to be different values of another random variable  with distribution  given by . In that case, the above inequality becomes:

Note to self: this assumes P[Y] = Q[Y].

Comment by johnswentworth on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T21:35:20.913Z · LW · GW

I wasn't imagining that the human knew the best answer to any given subproblem, but nonetheless that did flesh out a lot more of what it means (under your mental model) for a human to "understand a subproblem", so that was useful.

I'll try again:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action how well the AI's plan/action solves any particular subproblem if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

(... and presumably an unstated piece here is that "understanding how well the AI's plan/action solves a particular subproblem" might include recursive steps like "here's a sub-sub-problem, assume the AI's actions do a decent job solving that one", where the human might not actually check the sub-sub-problem.)

Does that accurately express the intended message?

Comment by johnswentworth on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T20:13:19.767Z · LW · GW

Based on this example and your other comment, it sounds like the intended claim of the post could be expressed as:

I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions solving many subproblems, where humans would eventually understand any particular action any particular subproblem and its solution if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action subproblem is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions subproblems at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost.

Does that accurately express the intended message?
 

Comment by johnswentworth on Scalable oversight as a quantitative rather than qualitative problem · 2024-07-06T19:00:20.884Z · LW · GW

... situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it...

Can you give an example (toy example is fine) of:

  • an action one might want to understand
  • what plan/strategy/other context that action is a part of
  • what it would look like for a human to understand the action

?

Mostly I'm confused what it would even mean to understand an action. Like, if I imagine a maze-solving AI, and I see it turn left at a particular spot (or plan to turn left), I'm not sure what it would even look like to "understand" that left-turn separate from understanding its whole maze-plan.

Comment by johnswentworth on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-06T15:16:58.761Z · LW · GW

I'd love to see more discussion by more people of the convergence ideas presented in Requirements for a Basin of Attraction to Alignment (and its prequel Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis).

+1, that was an underrated post.

Comment by johnswentworth on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-06T06:18:55.612Z · LW · GW

I don't see an accurate mathematical description of Human Values (in less than at very least gigabytes of mathematics, the size of the most compact possible description of us, our genome) as possible...

In that case, the standard advice would be to aim for mathematical rigor in (having a basin of convergence), (the AI allowing the user to correct it), etc. The hypothesis is that it's much more viable to achieve mathematical perfection in those things than to achieve a perfect representation of human values in one shot. On the flip side, things like (having a basin of convergence) or (the AI allowing the user to correct it) are notorious for subtle failure modes, so they're places where humans just providing some data without rigorously understanding what they're aiming for seem particularly likely to make lots of major systematic errors.

And note that if an AI has been trained on "ground truth" containing a bunch of systematic errors, it's less-than-usually likely to be much help for finding those errors.

(Tangential meta note: you're doing quite a good job playing through the whole game tree here, well done.)

Comment by johnswentworth on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-06T02:30:26.904Z · LW · GW

Eliezer's Lethality 22, while not worded to be about this proposal specifically, is in my head the standard first-stop objection to this sort of proposal:

Human operators are fallible, breakable, and manipulable.  Human raters make systematic errors - regular, compactly describable, predictable errors.  To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.  It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

Generalizable point: garbage in, garbage out. If humans try to create the sort of dataset proposed in the post, they will make systematic predictable errors. If humans try to create the dataset with the assistance of LLMs, the combined efforts of the humans and LLMs will still contain systematic predictable errors (debatable whether the LLMs would even be net-beneficial in that regard). One could maybe hope/argue that with "good enough" data the LLM will learn the "more natural" concept which humans were trying to convey via the data, and ignore those systematic errors (even though they're predictable), but such an argument would need to lean heavily on the inductive bias of the AI rather than the data.

Also, a note on this part:

If we were able to construct models that were say, "angelic" in their motivation 99% of the time and human 1% of the time, then by setting up suitable incentives for several such models crosschecking each other's behavior and moral reasoning, as long as we can avoid group-think or correlated collusion where several models conspire to switch to human mode at the same time...

Insofar as problems stem from systematic predictable errors in the training data, they will be highly correlated across instances. Running a bunch of "independent" instances will not help much, because their failures would not actually be independent; their failures all reflect the same shared underlying problems in the data-generation process.

Comment by johnswentworth on 3C's: A Recipe For Mathing Concepts · 2024-07-04T16:49:24.066Z · LW · GW

(Meta note: the post itself is not really about teleology, that's just an example. That said, here's how the tentative answers we give for teleology in the post would apply to your question.)

Insofar as the beak was optimized to consume certain kinds of food, we should find that it's unusually well-suited to those foods in multiple ways - for instance, not just that it's big, but that it's also the right shape for those foods. The more distinct unusual features can be explained by optimization toward the same purpose, the more evidence we have that optimization was applied toward that purpose.

On the other hand, if a large beak resulted from genetic drift or a single mutation which isn't actually fit, then we would not expect other features of the beak (or the rest of the bird) to also look like they're optimized for the same foods. (Note that this also applies to "features" at the genetic rather than phenotypic level: genetic drift or a single mutation would not produce a set of mutations which mostly look like they've been selected for the same criterion.)

A similar story applies to sexual selection: if a large beak is could have been selected for some foods or could have been selected for sexual purposes (or some combination of the two), then we should go look at other features of the beak/bird in order to whether those other features look like they're optimized for the same foods.

Comment by johnswentworth on Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned? · 2024-07-02T23:25:57.854Z · LW · GW

I think what it's highlighting is that there's a missing assumption. An analogy: Aristotle (with just the knowledge he historically had) might struggle to outsource the design of a quantum computer to a bunch of modern physics PhDs because (a) Aristotle lacks even the conceptual foundation to know what the objective is, (b) Aristotle has no idea what to ask for, (c) Aristotle has no ability to check their work because he has flatly wrong priors about which assumptions the physicists make are/aren't correct. The solution would be for Aristotle to go learn a bunch of quantum mechanics (possibly with some help from the physics PhDs) before even attempting to outsource the actual task. (And likely Aristotle himself would struggle with even the problem of learning quantum mechanics; he would likely give philosophical objections all over the place and then be wrong.)

Comment by johnswentworth on Why Can’t Sub-AGI Solve AI Alignment? Or: Why Would Sub-AGI AI Not be Aligned? · 2024-07-02T21:34:35.549Z · LW · GW

My stock answer: Why Not Just Outsource Alignment Research To An AI?

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-07-02T20:20:13.333Z · LW · GW

One example: you know that thing where I point at a cow and say "cow", and then the toddler next to me points at another cow and is like "cow?", and I nod and smile? That's the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying "cow"? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)

The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we're interested in questions like "what are those structures?", and interoperability helps narrow down the possibilities for what those structures could be.

... and I don't think I've yet fully articulated the general version of the problem here, but the cow example is at least one case where "just take the magic box to be the identity function" fails to answer our question.

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-07-02T19:59:47.835Z · LW · GW

But it seems pretty plausible that a major reason why humans arrive at these 'objective' 3rd-person world-models is because humans have a strong incentive to think about the world in ways that make communication possible.

This is an interesting point which I had not thought about before, thank you. Insofar as I have a response already, it's basically the same as this thread: it seems like understanding of interoperable concepts falls upstream of understanding non-interoperable concepts on the tech tree, and also there's nontrivial probability that non-interoperable concepts just aren't used much even by Solomonoff inductors (in a realistic environment).

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-07-02T19:26:14.842Z · LW · GW

I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that's true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.

That said:

  • I have substantial probability that AIs basically don't use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or ...). In those worlds, I expect that "how to understand human concepts" is the rate-limiting part of the problem.
  • Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is "earlier on the tech tree" than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.
Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-07-02T18:27:49.530Z · LW · GW

That particular paragraph was intended to be about two humans. The application to AI safety is less direct than "take Alice to be a human, and Bob to be an AI" or something like that.

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-07-02T18:24:25.161Z · LW · GW

First: if the random variables include latents which extend some distribution, then values of those latents are not necessarily representable as events over the underlying distribution. Events are less general. (Related: updates allowed under radical probabilism can be represented by assignments of values to latents.)

Second: I want formulations which feel like they track what's actually going on in my head (or other peoples' heads) relatively well. Insofar as a Bayesian model makes sense for the stuff going on in my head at all, it feels like there's a whole structure of latent variables, and semantics involves assignments of values to those variables. Events don't seem to match my mental structure as well. (See How We Picture Bayesian Agents for the picture in my head here.)

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-07-02T18:13:52.339Z · LW · GW

So really, rather than "the set of semantic targets is small", I should say something like "the set of semantic targets with significant prior probability is small", or something like that. Unclear exactly what the right operationalization is there, but I think I buy the basic point.

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-07-02T16:25:31.878Z · LW · GW

I don't think this is quite right? Most of the complexity of the box is supposed to be learned in an unsupervised way from non-language data (like e.g. visual data). If someone hasn't already done all that unsupervised learning, then they don't "know what's in the box", so they don't know how to extract semantics from words.

Comment by johnswentworth on Childhood and Education Roundup #6: College Edition · 2024-06-26T15:53:16.394Z · LW · GW

What’s totally crazy is doing the math on this.

  1. About 50% of college applications are from white students.
  2. White students report they lied 34% of the time.
  3. Of those students, 48% pretended to be Native American.
  4. That means that 5.8% of applications are falsely claiming to be Native American.

0.5 * 0.34 * 0.48 = .0816, not .058

Comment by johnswentworth on What is a Tool? · 2024-06-26T01:28:02.380Z · LW · GW

Perhaps it's just the terminolgoy of the story but are tools modeled as physical objects or are they more general than that?

I was trying to be agnostic about that. I think the cognitive characterization works well for a fairly broad notion of tool, including thought-tools, mathematical proof techniques, or your examples of governments and markets.

With footnote 4, is the point that...

There may be subproblems which don't fit any cluster well, there may be clusters which don't have any standard tool, some of those subproblem-clusters may be "solved" by a tool later but maybe some never will be... there's a lot of possibilities.

Comment by johnswentworth on johnswentworth's Shortform · 2024-06-22T18:25:58.551Z · LW · GW

The easiest answer is to look at the specs. Of course specs are not super reliable, so take it all with many grains of salt. I'll go through the AMD/Nvidia comparison here, because it's a comparison I looked into a few months back.

MI300x vs H100

Techpowerup is a third-party site with specs for the MI300x and the H100, so we can do a pretty direct comparison between those two pages. (I don't know if the site independently tested the two chips, but they're at least trying to report comparable numbers.) The H200 would arguably be more of a "fair comparison" since the MI300x came out much later than the H100; we'll get to that comparison next. I'm starting with MI300x vs H100 comparison because techpowerup has specs for both of them, so we don't have to rely on either company's bullshit-heavy marketing materials as a source of information. Also, even the H100 is priced 2-4x more expensive than the MI300x (~$30-45k vs ~$10-15k), so it's not unfair to compare the two.

Key numbers (MI300x vs H100):

  • float32 TFLOPs: ~80 vs ~50
  • float16 TFLOPs: ~650 vs ~200
  • memory: 192 GB vs 80 GB (note that this is the main place where the H200 improves on the H100)
  • bandwidth: ~10 TB/s vs ~2 TB/s

... so the comparison isn't even remotely close. The H100 is priced 2-4x higher but is utterly inferior in terms of hardware.

MI300x vs H200

I don't know of a good third-party spec sheet for the H200, so we'll rely on Nvidia's page. Note that they report some numbers "with sparsity" which, to make a long story short, means those numbers are blatant marketing bullshit. Other than those numbers, I'll take their claimed specs at face value.

Key numbers (MI300x vs H200):

  • float32 TFLOPs: ~80 vs ~70
  • float16 TFLOPs: don't know, Nvidia conspicuously avoided reporting that number
  • memory: 192 GB vs 141 GB
  • bandwidth: ~10 TB/s vs ~5 TB/s

So they're closer than the MI300x vs H100, but the MI300x still wins across the board. And pricewise, the H200 is probably around $40k, so 3-4x more expensive than the MI300x.

Comment by johnswentworth on "... than average" is (almost) meaningless · 2024-06-21T17:18:33.315Z · LW · GW

so you have no reason to conclude that their X-skill is normal for any arbitrary X

I mean, you have some priors, and priors are a totally valid thing to reason from even if you're not taking a uniform random sample of all people.

In general, "I have a few samples of X, and I don't have any particular reason to think they're unusual samples, so on priors they're probably typical" is a totally valid way to reason even if your samples aren't uniform random from the population.

Comment by johnswentworth on johnswentworth's Shortform · 2024-06-21T16:50:23.675Z · LW · GW

NVIDIA Is A Terrible AI Bet

Short version: Nvidia's only moat is in software; AMD already makes flatly superior hardware priced far lower, and Google probably does too but doesn't publicly sell it. And if AI undergoes smooth takeoff on current trajectory, then ~all software moats will evaporate early.

Long version: Nvidia is pretty obviously in a hype-driven bubble right now. However, it is sometimes the case that (a) an asset is in a hype-driven bubble, and (b) it's still a good long-run bet at the current price, because the company will in fact be worth that much. Think Amazon during the dot-com bubble. I've heard people make that argument about Nvidia lately, on the basis that it will be ridiculously valuable if AI undergoes smooth takeoff on the current apparent trajectory.

My core claim here is that Nvidia will not actually be worth much, compared to other companies, if AI undergoes smooth takeoff on the current apparent trajectory.

Other companies already make ML hardware flatly superior to Nvidia's (in flops, memory, whatever), and priced much lower. AMD's MI300x is the most obvious direct comparison. Google's TPUs are probably another example, though they're not sold publicly so harder to know for sure.

So why is Nvidia still the market leader? No secret there: it's the CUDA libraries. Lots of (third-party) software is built on top of CUDA, and if you use non-Nvidia hardware then you can't use any of that software.

That's exactly the sort of moat which will disappear rapidly if AI automates most-or-all software engineering, and on current trajectory software engineering would be one of the earlier areas to see massive AI acceleration. In that world, it will be easy to move any application-level program to run on any lower-level stack, just by asking an LLM to port it over.

So in worlds where AI automates software engineering to a very large extent, Nvidia's moat is gone, and their competition has an already-better product at already-lower price.

Comment by johnswentworth on "... than average" is (almost) meaningless · 2024-06-21T05:35:47.424Z · LW · GW

Here's is how I think I intuitively interpret such statements:

  • "I am a pretty average cook" -> Either
    • I have no particular evidence about my cooking ability and therefore on priors expect to be fairly typical, or
    • I fall roughly in the middle of the few people I know well, and have no particular reason to think those people are very unusual in terms of cooking-skill, and therefore on priors + a little data I expect my skills to be fairly typical
  • "I am an above average cook" -> I am pretty good compared to the few people I know well, and I have no particular reason to think they're unusually bad cooks, so on priors + a little data I expect my skills to be better than typical.
Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-06-19T17:24:19.161Z · LW · GW

I would make a similar critique of basically-all the computational approaches I've seen to date. They generally try to back out "semantics" from a text corpus, which means their "semantics" grounds out in relations between words; neither the real world nor mental content make any appearance. They may use Bayes' rule and latents like this post does, but such models can't address the kinds of questions this post is asking at all.

(I suppose my complaints are more about structuralism than about model-theoretic foundations per se. Internally I'd been thinking of it more as an issue with model-theoretic foundations, since model theory is the main route through which structuralism has anything at all to say about the stuff which I would consider semantics.)

Of course you might have in mind some body of work on computational linguistics/semantics with which I am unfamiliar, in which case I would be quite grateful for my ignorance to be corrected!

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-06-19T16:12:02.246Z · LW · GW

Again, interesting work, hope this didn't come off too combative!

Not at all, you are correctly critiquing the downsides of a trade-off which we consciously made. 

There was a moment during writing when David suggested we soften the title/opening to avoid alienating classical semantics researchers. I replied that I expected useful work on the project to mostly come, not from people with a background in classical semantics, but from people who had bounced off of classical semantics because they intuited that it "wasn't addressing the real problem". Those are the people who've already felt, on a gut level, a need for the sort of models the post outlines. (Also, as one person who reviewed the post put it: "Although semantics would suggest that this post would be interesting to logicians, linguists and their brethren [...] I think they would not find it interesting because it is a seemingly nonsymbolic attempt to approach semantics. Symbolical methods are their bread and butter, without them they would be lost.")

To that end, the title and opening are optimized to be a costly signal to those people who bounced off classical semantics, signalling that they might be interested in this post even though they've been unsatisfied before with lots of work on semantics. The cost of that costly signal is alienating classical semantics researchers. And having made that trade-off upfront, naturally we didn't spend much time trying to express this project in terms more familiar to people in the field.

If we were writing a version more optimized for people already in the field, I might have started by saying that the core problem is the use of model theory as the primary foundation for semantics (i.e. Montague semantics and its descendants as the central example). That foundation explicitly ignores the real world, and is therefore only capable of answering questions which don't require any notion of the real world - e.g. Montague nominally focused on how truth values interact with syntax. Obviously that is a rather narrow and impoverished subset of the interesting questions about natural language semantics, and I would have then gone through some standard alternate approaches (and critiques) which do allow a role for the real world, before introducing our framework.

Comment by johnswentworth on My AI Model Delta Compared To Christiano · 2024-06-19T01:49:52.978Z · LW · GW

Ehh, yes and no. I maybe buy that a median human doing a deep dive into a random object wouldn't notice the many places where there is substantial room for improvement; hanging around with rationalists does make it easy to forget just how low the median-human bar is.

But I would guess that a median engineer is plenty smart enough to see the places where there is substantial room for improvement, at least within their specialty. Indeed, I would guess that the engineers designing these products often knew perfectly well that they were making tradeoffs which a fully-informed customer wouldn't make. The problem, I expect, is mostly organizational dysfunction (e.g. the committee of engineers is dumber than one engineer, and if there are any nontechnical managers involved then the collective intelligence nosedives real fast), and economic selection pressure.

For instance, I know plenty of software engineers who work at the big tech companies. The large majority of them (in my experience) know perfectly well that their software is a trash fire, and will tell you as much, and will happily expound in great detail the organizational structure and incentives which lead to the ongoing trash fire.

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-06-18T17:31:22.407Z · LW · GW

Great examples! I buy them to varying extents:

  • Features like "edge" or "dappled" were IIRC among the first discoveries when people first started doing interp on CNNs back around 2016 or so. So they might be specific to a data modality (i.e. vision), but they're not specific to the human brain's learning algorithm.
  • "Behind" seems similar to "edge" and "dappled", but at a higher level of abstraction; it's something which might require a specific data modality but probably isn't learning algorithm specific.
  • I buy your claim a lot more for value-loaded words, like "I'm feeling down", the connotations of "contaminate", and "much". (Note that an alien mind might still reify human-value-loaded concepts in order to model humans, but that still probably involves modeling a lot of the human learning algorithm, so your point stands.)
  • I buy that "salient" implies an attentional spotlight, but I would guess that an attentional spotlight can be characterized without modeling the bulk of the human learning algorithm.
  • I buy that the semantics of "and" or "but" are pretty specific to humans' language-structure, but I don't actually care that much about the semantics of connectives like that. What I care about is the semantics of e.g. sentences containing "and" or "but".
  • I definitely buy that analogies like "butterfingers" are a pretty large chunk of language in practice, and it sure seems hard to handle semantics of those without generally understanding analogy, and analogy sure seems like a big central piece of the human learning algorithm.

At the meta-level: I've been working on this natural abstraction business for four years now, and your list of examples in that comment is one of the most substantive and useful pieces of pushback I've gotten in that time. So the semantics frame is definitely proving useful!

One mini-project in this vein which would potentially be high-value would be for someone to go through a whole crapton of natural language examples and map out some guesses at which semantics would/wouldn't be convergent across minds in our environment.

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-06-18T03:51:48.490Z · LW · GW

That sentence was mostly just hedging. The intended point was that the criteria which we focused on in the post aren't necessarily the be-all end-all.

Comment by johnswentworth on Towards a Less Bullshit Model of Semantics · 2024-06-18T03:50:15.425Z · LW · GW

…where you seem to see just one stage, if I understand this post correctly.

Oh, I totally agree with your mental model here. It's implicit in the clustering toy model, for example: the agents fit the clusters to some data (stage 1), and only after that can they match words to clusters with just a handful of examples of each word (stage 2).

In that frame, the overarching idea of the post is:

  • We'd like to characterize what the (convergent/interoperable) latents are.
  • ... and because stage 2 exists, we can use language (and our use thereof) as one source of information about those latents, and work backwards. Working forwards through stage 1 isn't the only option.

Also, I think Stage 1 (i.e. sensory input → generative model) is basically the hard part of AGI capabilities. [...] So I have strong misgivings about a call-to-arms encouraging people to sort that out.

Note that understanding what the relevant latents are does not necessarily imply the ability to learn them efficiently. Indeed, the toy models in the post are good examples: both involve recognizing "naturality" conditions over some stuff, but they're pretty agnostic about the learning stage.

I admit that there's potential for capability externalities here, but insofar as results look like more sophisticated versions of the toy models in the post, I expect this work to be multiple large steps away from application to capabilities.

Comment by johnswentworth on Generalizing Koopman-Pitman-Darmois · 2024-06-14T16:30:11.537Z · LW · GW

The nice thing about using odds (or log odds) is that the normalizer cancels out when using Bayes' rule. For a boolean query  and data , it looks like this:

  • Bayes' rule, usual form: , where  is the normalizer
  • Bayes' rule, odds form: , where  is not-.

That's the usual presentation. But note that it assumes  is boolean. How do we get the same benefit - i.e. a form which in which the normalizer cancels out in Bayes' rule - for non-boolean ?

The trick is to choose a reference value of , and compute probabilities relative to the probability of that reference value. For instance, if  is a six-sided die roll, I could choose  as my reference value, and then I'd represent the distribution as . You can check that, when I update this distribution to , and represent the updated distribution as , the normalizer cancels out just like it does for the odds form on a boolean variable.

That's the trick used for  in the post. Applying the trick requires picking an arbitrary value of  to use as the reference value (like  above), and that's .

Comment by johnswentworth on My AI Model Delta Compared To Christiano · 2024-06-13T21:10:08.696Z · LW · GW

Most real-world problems are outside of NP. Let's go through some examples...

Suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values). Can I easily write down a boolean circuit (possibly with some inputs from data on fridges) which is satisfiable if-and-only-if this fridge in particular is in fact the best option for me according to my own long-term values? No, I have no idea how to write such a boolean circuit at all. Heck, even if my boolean circuit could internally use a quantum-level simulation of me, I'd still have no idea how to do it, because neither my stated values nor my revealed preferences are identical to my own long-term values. So that problem is decidedly not in NP.

(Variant of that problem: suppose an AI hands me a purported mathematical proof that this fridge in particular is the best option for me according to my own long-term values. Can I verify the proof's correctness? Again, no, I have no idea how to do that, I don't understand my own values well enough to distinguish a proof which makes correct assumptions about my values from one which makes incorrect assumptions.)

A quite different example from Hindsight Isn't 20/20: suppose our company has 100 workers, all working to produce a product. In order for the product to work, all 100 workers have to do their part correctly; if even just one of them messes up, then the whole product fails. And it's an expensive one-shot sort of project; we don't get to do end-to-end tests a billion times. I have been assigned to build the backup yellow connector widget, and I do my best. The product launches. It fails. Did I do my job correctly? No idea, even in hindsight; isolating which parts failed would itself be a large and expensive project. Forget writing down a boolean circuit in advance which is satisfiable if-and-only-if I did my job correctly; I can't even write down a boolean circuit in hindsight which is satisfiable if-and-only-if I did my job correctly. I simply don't have enough information to know.

Another kind of example: I read a paper which claims that FoxO mediates the inflammatory response during cranial vault remodelling surgery. Can I easily write down a boolean circuit (possibly with some inputs from the paper) which is satisfiable if-and-only-if the paper's result is basically correct? Sure, it could do some quick checks (look for p-hacking or incompetently made-up data, for example), but from the one paper I just don't have enough information to reliably tell whether the result is basically correct.

Another kind of example: suppose I'm building an app, and I outsource one part of it. The contractor sends me back a big chunk of C code. Can I verify that (a) the C code does what I want, and (b) the C code has no security holes? In principle, formal verification tools advertise both of those. In practice, expressing what I want in a formal verification language is as-much-or-more-work as writing the code would be (assuming that I understand what I want well enough to formally express it at all, which I often don't). And even then, I'd expect someone who's actually good at software security to be very suspicious of the assumptions made by the formal verifier.

Comment by johnswentworth on My AI Model Delta Compared To Christiano · 2024-06-13T16:46:55.066Z · LW · GW

Yes, though I would guess my probability on P = NP is relatively high compared to most people reading this. I'm around 10-15% on P = NP.

Notably relevant:

People who’ve spent a lot of time thinking about P vs NP often have the intuition that “verification is easier than generation”. [...]

The problem is, this intuition comes from thinking about problems which are in NP. NP is, roughly speaking, the class of algorithmic problems for which solutions are easy to verify. [...]

I think a more accurate takeaway would be that among problems in NP, verification is easier than generation. In other words, among problems for which verification is easy, verification is easier than generation. Rather a less impressive claim, when you put it like that.

Comment by johnswentworth on My AI Model Delta Compared To Christiano · 2024-06-13T03:56:53.448Z · LW · GW

As you're doing these delta posts, do you feel like it's changing your own positions at all?

Mostly not, because (at least for Yudkowsky and Christiano) these are deltas I've been aware of for at least a couple years. So the writing process is mostly just me explaining stuff I've long since updated on, not so much figuring out new stuff.

Comment by johnswentworth on My AI Model Delta Compared To Yudkowsky · 2024-06-13T00:15:41.243Z · LW · GW

I do not think SAE results to date contribute very strong evidence in either direction. "Extract all the abstractions from a layer" is not obviously an accurate statement of what they do, and the features they do find do not obviously faithfully and robustly map to human concepts, and even if they did it's not clear that they compose in human-like ways. They are some evidence, but weak.

Comment by johnswentworth on My AI Model Delta Compared To Yudkowsky · 2024-06-12T22:57:59.070Z · LW · GW

My current best guess is that spacetime locality of physics is the big factor - i.e. we'd get a lot of similar high-level abstractions (including e.g. minds/reasoning/sensing) in other universes with very different physics but similar embedding of causal structure into 4 dimensional spacetime.

Comment by johnswentworth on My AI Model Delta Compared To Christiano · 2024-06-12T21:50:44.578Z · LW · GW

Yeah, I think this is very testable, it's just very costly to test - partly because it requires doing deep dives on a lot of different stuff, and partly because it's the sort of model which makes weak claims about lots of things rather than very precise claims about a few things.