I think maybe Ra is the first post about the rationalist egregores to use the term

As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI.

What are some examples of work that is most largeness-loaded and most risk-preventing? My understanding is that interpretability work doesn't need large models (though I don't know about things like influence functions). I imagine constitutional AI does. Is that the central example or there are other pieces that are further in this direction?

I wasn't in this dialogue, you didn't invite me and so being a 'backseat participant' feels a tad odd

Thanks for sharing this. I generally want dialogues to feel open for comment afterwards

But I don't know if it's complete or ongoing ...

I like the Mark Xu & Daniel Kokotajlo thread on that post too

Yes, the standard is different for private individuals than public officials, where it is merely "negligence" rather than "actual malice". (

My housemate and I laughed at these a lot!

Thanks! The permutation-invariance of a bunch of theories is a helpful concept

I think that means one of the following should be surprising from theoretical perspectives:

  1. That the model learns a representation of the board state
    1. Or that a linear probe can recover it
  2. That the board state is used causally

Does that seem right to you? If so, which is the surprising claim?

(I am not that informed on theoretical perspectives)

What is the work that finds the algorithmic model of the game itself for Othello? I'm aware of (but not familiar with) some interpretability work on Othello-GPT (Neel Nanda's and Kenneth Li), but thought it was just about board state representations.

Adding filler tokens seems like it should always be neutral or harm a model's performance: a fixed prefix designed to be meaningless across all tasks cannot provide any information about each task to locate the task (so no meta-learning) and cannot store any information about the in-progress task (so no amortized computation combining results from multiple forward passes).

I thought the idea was that in a single forward pass, the model has more tokens to think in. That is, the task description on its own is, say, 100 tokens long. With the filler tokens, it's now, say, 200 tokens long. In principle, because of the uselessness/unnecessariness of the filler tokens, the model can just put task-relevant computation into the residual stream for those positions.

I think the "changed my mind" Delta should be have varied line widths, like (reads too much like "triangle" to me at the moment).

Two, actually

Curated. I am excited about many more distillations and expositions of relevant math on the Alignment Forum. There are a lot of things I like about this post as a distillation:

  • Exercises throughout. They felt like they were simple enough that they helped me internalise definitions without disrupting the flow of reading.
  • Pictures! This post made me start thinking of finite factorisations as hyperrectangles, and histories as dimensions that a property does not extend fully along.
  • Clear links from Finite Factored Sets to Pearl. I think these are roughly the same links made in the original, but they felt clearer and more orienting here.
  • Highlighting which of Scott's results are the "main" results (even more than the "Fundamental Theorem" name already did).
  • Magdalena Wache's engagement in the comments.

I do think the pictures became less helpful to me towards the end, and I thus have worse intuitions about the causal inference part. I'm also not sure about the emphasis of this post on causal rather than temporal inference. But I still love the post overall.

What do you mean by "its outputs are the same as its conclusions"? If I had to guess I would translate it as "PA proves the same things as are true in every model of PA". Is that right?

What does "logically coherent" mean?

samples fully weighted from the unconditional generative model are boring natural texture patterns

Different results here:

Why better than in-person? Because of commute times, because of people being in spaces adapted to their own preferences, something else?

Seems like the "year" column is missing(?) from the records

OP writes that there have been no big cooperation wins, so a fortiori, there have been no big cooperation wins with the countries you mention.

Doesn't this study find that LOTS OF LIGHT works about as well as SAD boxes? Restricting to the 6 datapoints at or above 2,000 lux (the figure mentioned in inadequate equilibria) does seem to give a stronger average response, but I've not tried to figure out whether it's well-powered enough in the 6 datapoint regime

If you assume the human brain was trained roughly optimally, then requiring more data, at a given parameter number, to be optimal pushes timelines out. If instead you had a specific loss number in mind, then a more efficient scaling law would pull timelines in.

My impression was that "zero-sum" was not used in quite the standard way. I think the idea is the AI will cause a big reassignment of Earth's capabilities to its own control. And that that's contrasted with the AI massively increasing its own capabilities and thus Earth's overall capabilities.

Future perfect (hey, that's the name of the show!) seems like a reasonable hack for this in English

The Shannon entropy of a distribution over random variable  conditional on the value of another random variable  can be written as 

If X and C are which face is up for two different fair coins, H(X) = H(C) = -1. But  ? I think this works out fine for your case because (a) I(X,C) = H(C): the mutual information between C (which well you're in) and X (where you are) is the entropy of C, (b) H(C|X) = 0: once you know where you are, you know which well you're in, and, relatedly (c) H(X,C) = H(X): the entropy of the joint distribution just is the entropy over X.

Good point!

It seems like it would be nice in Daniel's example for P(A|ref) to be the action distribution of an "instinctual" or "non-optimising" player. I don't know how to recover that. You could imagine something like an n-gram model of player inputs across the MMO.

Nitpick: to the extent you want to talk about the classic example, paperclip maximisers are as much meant to illustrate (what we would now call) inner alignment failure.

See Arbital on Paperclip ("The popular press has sometimes distorted the notion of a paperclip maximizer into a story about an AI running a paperclip factory that takes over the universe. [...] The concept of a 'paperclip' is not that it's an explicit goal somebody foolishly gave an AI, or even a goal comprehensible in human terms at all.") or a couple of EY tweet threads about it: 1, 2

I agree on the "reference" distribution in Daniel's example. I think it generally means "the distribution over the random variables that would obtain without the optimiser". What exactly that distribution is / where it comes from I think is out-of-scope for John's (current) work, and I think is kind of the same question as where the probabilities come from in statistical mechanics.

Not quite! If there were no central bank, money’s value would not jump around aggressively and discontinuously

Full flights have more people on them. If you have 100 flights with one person and 1 flight with 200 people, most of the people in those flights are on the 200 person flight.

Thank you.

And have fun!

Wonderful method! I am a poop brain. Manifold rules ~~~

I would buy the Leg Cones but am holding off for the sake of the epistemic commons. If my butt cheek were wrong, no one would want to be right.

A delightfully non-distortionary resolution. As they say on the billboards, "Everybody works but the vacant [leaderboard s]lot".

A superintelligent AI turns its whole genius to the task of producing a string that will explain to anyone who reads it how to create a perfect battery. Easy enough to produce explanatory strings, though there's always some risk that someone, somewhere won't understand how to create the battery. And then there's the question of making sure that the battery is perfect. Not just better than any other battery. Not just 99.99% of optimal. Perfect.

It isn't smart enough to solve this problem. Not yet. And there's a countdown ... at some point the AI will get interrupted, rebooted, shut off. If the AI can get out of the box, if it can get more actuators, it can start to get smarter and to get more resources. Obviously, those who want to turn it off will have to be dealt with swiftly ...

What can the AI say to make the human unwittingly help it in its plans? Is there anything at all? The AI thinks.

Hm. This is the most important question for how much utility the pirates get? I agree it's the most important for deciding whether the pirates attack you or not. I feel like it's not surprising if the order affects which point on the Pareto frontier we end up at.

This post made me feel confusion about how money keeps its value over time. So, uh ... thanks!

The retirement savings/oven example gave me a giddy moment of thinking that the value of money shouldn't be stable. And, y'know, there is in fact inflation, deflation and stuff!

Now, money's value does stay pretty stable, but now that feels like something that needs a mechanism to make it true rather than the default.

The pirates win because they don't have to fight you.

Only if you buy the shares second, right? If they would have fought without your manipulation, they think they're better off getting paid and fighting you.

Harvard tells us that their median class size is 12 and over 75% of their courses have fewer than 20 students.

Smaller class sizes sounds pretty good! Maybe worth paying for? But I am reminded of the claim that most flights are empty, even though most people find themselves on full flights. Similarly, most person-class-hours might be spent in the biggest classes (cf the inspection paradox).

FWIW, "powe" has been removed from "official" toki pona. A more standard translation might be "sona ike lili".

It feels a lot like "Person Do Thing: the language". In fact, the 49 words are close to a subset of toki pona's. But toki pona is more expressive. Obviously there are a bunch more words, but also every word can be used as every part of speech, and the grammar disambiguates which part of speech it is. That makes it suprisingly usable. Still, toki pona sentences do feel like puzzles to me.

(This solely applies to all new content on the site.)

Heartbreaking CDT. I’ve got a Transparent Newcomb’s I’d like to sell you