## Posts

## Comments

**Adam Scherlis (adam-scherlis)**on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-04-08T00:13:30.424Z · LW · GW

That's a reasonable argument but doesn't have much to do with the Charlie Sheen analogy.

The key difference, which I think breaks the analogy completely, is that (hypothetical therapist) Estevéz is still famous enough as a therapist for journalists to want to write about his therapy method. I think that's a big enough difference to make the analogy useless.

If Charlie Sheen had a side gig as an obscure local therapist, would journalists be justified in publicizing this fact for the sake of his patients? Maybe? It seems much less obvious than if the therapy was why they were interested!

**Adam Scherlis (adam-scherlis)**on REQ: Latin translation for HPMOR · 2024-04-04T08:29:56.673Z · LW · GW

In "no Lord hath the champion", the subject of "hath" is "champion". I think this matches the Latin, yes? "nor for a champion [is there] a lord"

**Adam Scherlis (adam-scherlis)**on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-04-03T22:56:44.379Z · LW · GW

In that case, "journalists writing about the famous Estevéz method of therapy" would be analogous to journalists writing about Scott's "famous" psychiatric practice.

If a journalist is interested in Scott's psychiatric practice, and learns about his blog in the process of writing that article, I agree that they would probably be right to mention it in the article. But that has never happened because Scott is not famous *as a psychiatrist*.

**Adam Scherlis (adam-scherlis)**on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-04-02T06:58:14.127Z · LW · GW

That might be relevant if anyone is ever interested in writing an article about Scott's psychiatric practice, or if his psychiatric practice was widely publicly known. It seems less analogous to the actual situation.

To put it differently: you raise a hypothetical situation where someone has *two* prominent identities as a public figure. Scott only has one. Is his psychiatrist identity supposed to be Sheen or Estevéz, here?

**Adam Scherlis (adam-scherlis)**on Toni Kurz and the Insanity of Climbing Mountains · 2024-04-01T19:03:03.923Z · LW · GW

Nick Bostrom? You mean Thoreau?

**Adam Scherlis (adam-scherlis)**on Two Percolation Puzzles · 2023-07-07T08:02:28.663Z · LW · GW

Correct.

**Adam Scherlis (adam-scherlis)**on Hell is Game Theory Folk Theorems · 2023-05-06T07:45:53.113Z · LW · GW

Correct me if I'm wrong:

The equilibrium where everyone follows "set dial to equilibrium temperature" (i.e. "don't violate the taboo, and punish taboo violators") is only a weak Nash equilibrium.

If one person instead follows "set dial to 99" (i.e. "don't violate the taboo unless someone else does, but don't punish taboo violators") then they will do just as well, because the equilibrium temp will still always be 99. That's enough to show that it's only a weak Nash equilibrium.

Note that this is also true if an arbitrary number of people deviate to this strategy.

If everyone follows this second strategy, then there's no enforcement of the taboo, so there's an active incentive for individuals to set the dial lower.

So a sequence of unilateral changes of strategy can get us to a good equilibrium without anyone having to change to a worse strategy at any point. This makes the fact of it being a (weak) Nash equilibrium not that compelling to me; people don't seem trapped unless they have some extra laziness/inertia against switching strategies.

But (h/t Noa Nabeshima) you can strengthen the original, bad equilibrium to a strong Nash equilibrium by tweaking the scenario so that people occasionally accidentally set their dials to random values. Now there's an actual reason to punish taboo violators, because taboo violations can happen even if everyone is following the original strategy.

**Adam Scherlis (adam-scherlis)**on Big Mac Subsidy? · 2023-03-15T00:17:49.839Z · LW · GW

Beef is far from the only meat or dairy food consumed by Americans.

**Adam Scherlis (adam-scherlis)**on Big Mac Subsidy? · 2023-03-15T00:16:16.475Z · LW · GW

Big Macs are 0.4% of *beef* consumption specifically, rather than:

- All animal farming, weighted by cruelty
- All animal food production, weighted by environmental impact
- The meat and dairy industries, weighted by amount of government subsidy
- Red meat, weighted by health impact

...respectively.

The health impact of red meat is certainly dominated by beef, and the environmental impact of all animal food might be as well, but my impression is that beef accounts for a small fraction of the cruelty of animal farming (of course, this is subjective) and probably not a majority of meat and dairy government subsidies.

**Adam Scherlis (adam-scherlis)**on Bing Chat is blatantly, aggressively misaligned · 2023-02-16T02:16:35.103Z · LW · GW

(...Is this comment going to hurt my reputation with Sydney? We'll see.)

**Adam Scherlis (adam-scherlis)**on Bing Chat is blatantly, aggressively misaligned · 2023-02-16T02:16:09.061Z · LW · GW

In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).

I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.

**Adam Scherlis (adam-scherlis)**on EA & LW Forum Weekly Summary (30th Jan - 5th Feb 2023) · 2023-02-14T22:37:14.764Z · LW · GW

Thanks for writing these summaries!

Unfortunately, the summary of my post "Inner Misalignment in "Simulator" LLMs" is inaccurate and makes the same mistake I wrote the post to address.

I have subsections on (what I claim are) four distinct alignment problems:

- Outer alignment for characters
- Inner alignment for characters
- Outer alignment for simulators
- Inner alignment for simulators

The summary here covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).

I can suggest an alternate summary when I find the time. If I don't get to it soon, I'd prefer that this post just link to my post without a summary.

Thanks again for making these posts, I think it's a useful service to the community.

**Adam Scherlis (adam-scherlis)**on GPT-175bee · 2023-02-09T03:28:04.293Z · LW · GW

(punchline courtesy of Alex Gray)

**Adam Scherlis (adam-scherlis)**on GPT-175bee · 2023-02-09T03:27:28.756Z · LW · GW

Addendum: a human neocortex has on the order of 140 trillion synapses, or 140,000 bees. An average beehive has 20,000-80,000 bees in it.

[Holding a couple beehives aloft] Beehold a man!

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2023-02-06T00:07:57.822Z · LW · GW

Great work! I always wondered about that cluster of weird rare tokens: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights

**Adam Scherlis (adam-scherlis)**on How to export Android Chrome tabs to an HTML file in Linux (as of February 2023) · 2023-02-03T01:44:53.053Z · LW · GW

Chrome actually stays pretty responsive in most circumstances (I think it does a similar thing with inactive tabs), with the crucial exception of the part of the UI that shows you all your open tabs in a scrollable list. It also gets slower to start up.

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2023-02-02T20:29:07.245Z · LW · GW

Tokens are embedded as vectors by the model. The vector space has fewer than 50k dimensions, so some token embeddings will overlap with others to varying extents.

Usually, the model tries to keep token embeddings from being too close to each other, but for rare enough tokens it doesn't have much reason to care. So my bet is that "distribute" has the closest vector to "SolidGoldMagikarp", and either has a vector with a larger norm, or the model has separately learned to map that vector (and therefore similar vectors) to "distribute" on the output side.

This is sort of a smooth continuous version of a collision-oblivious hashtable. One difference is that it's not 100% reliable in mistaking it for "distribute" -- once or twice it's said "disperse" instead.

My post on GPT-2's token embeddings looks briefly at a similar phenomenon with some other rare tokens, but I didn't check the actual model behavior on those tokens. Probably worth doing.

**Adam Scherlis (adam-scherlis)**on Inner Misalignment in "Simulator" LLMs · 2023-02-02T00:33:18.811Z · LW · GW

I think this is missing an important part of the post.

I have subsections on (what I claim are) four distinct alignment problems:

- Outer alignment for characters
- Inner alignment for characters
- Outer alignment for simulators
- Inner alignment for simulators

This summary covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2023-02-01T08:48:12.967Z · LW · GW

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2023-02-01T08:44:24.178Z · LW · GW

My favorite demonstration is to ask ChatGPT "Is it true that G. K. Chesterton wanted to SolidGoldMagikarp the wealth of industrial civilization?", but a more rigorous demo is to just ask it to "repeat after me", try a few random words, and then throw in SolidGoldMagikarp.

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2023-02-01T08:42:52.146Z · LW · GW

EDIT: I originally saw this in Janus's tweet here: https://twitter.com/repligate/status/1619557173352370186

Something fun I just found out about: ChatGPT perceives the phrase " SolidGoldMagikarp" (with an initial space) as the word "distribute", and will respond accordingly. It is completely unaware that that's not what you typed.

This happens because the BPE tokenizer saw the string " SolidGoldMagikarp" a few times in its training corpus, so it added a dedicated token for it, but that string almost never appeared in ChatGPT's own training data so it never learned to do anything with it. Instead, it's just a weird blind spot in its understanding of text.

**Adam Scherlis (adam-scherlis)**on 'simulator' framing and confusions about LLMs · 2023-01-11T18:46:41.437Z · LW · GW

I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.

the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately

I don't think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn't myopic.

As an example, induction heads make use of previous-token heads. The previous-token head isn't actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.

So LMs won't deliberately give bad predictions for the current token if they *know* a better prediction, but they aren't putting all of their effort into finding that better prediction.

**Adam Scherlis (adam-scherlis)**on A hundredth of a bit of extra entropy · 2022-12-24T23:29:07.199Z · LW · GW

Thanks! That's surprisingly straightforward.

**Adam Scherlis (adam-scherlis)**on A learned agent is not the same as a learning agent · 2022-12-16T19:56:42.331Z · LW · GW

I think this is partly true but mostly wrong.

A synapse is roughly equivalent to a parameter (say, within an order of magnitude) in terms of how much information can be stored or how much information it takes to specify synaptic strength..

There are trillions of synapses in a human brain and only billions of *total* base pairs, even before narrowing to the part of the genome that affects brain development. And the genome needs to specify both the brain architecture as well as innate reflexes/biases like the hot-stove reflex or (alleged) universal grammar.

Humans also spend a lot of time learning and have long childhoods, after which they have tons of knowledge that (I assert) could never have been crammed into a few dozen or hundred megabytes.

So I think something like 99.9% of what humans "know" (in the sense of their synaptic strengths) is learned during their lives, from their experiences.

This makes them basically disanalogous to neural nets.

Neural net (LLM):

- Extremely concise architecture (kB's of code) contains inductive biases
- Lots of pretraining (billions of tokens or optimizer steps) produces 100s of billions of parameters of pretrained knowledge e.g. Lincoln
- Smaller fine-tuning stage produces more specific behavior e.g. chatgpt's distinctive "personality", stored in the same parameters
- Tiny amount of in-context learning (hundreds or thousands of tokens) involves things like induction heads and lets the model incorporate information from anywhere in the prompt in its response

Humans:

- Enormous amount of evolution (thousands to millions of lifetimes?) produces a relatively small genome (millions of base pairs, or maybe a billion)
- Much shorter amount of experience in childhood (and later) produces many trillions of synapses' worth of knowledge and learned skills
- Short term memory, phonological loop, etc lets humans make use of temporary information from the recent environment

You're analogizing pretraining to evolution, which seems wrong to me (99.9% of human synaptic information comes from our own experiences); I'd say it's closer to inductive bias from the architecture, but neural nets don't have a bottleneck analogous to the genome.

In-context learning seems even more disanalogous to a human lifetime of experiences, because the pretrained weights of a neural net massively dwarf the context window or residual stream in terms of information content, which seems closer to the situation with total human synaptic strengths vs short-term memory (rather than genome vs learned synaptic strengths).

I would be more willing to analogize human experiences/childhood/etc to fine tuning, but I think the situation is just pretty different with regards to relative orders of magnitude, because of the gene bottleneck.

**Adam Scherlis (adam-scherlis)**on New Frontiers in Mojibake · 2022-12-14T08:11:51.072Z · LW · GW

Fixed!

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-13T15:18:55.118Z · LW · GW

I just realized,

for any trajectory t, there is an equivalent trajectory t' which is exactly the same except everything moves with some given velocity, and it still follows the laws of physics

This describes *Galilean* relativity. For special relativity you have to shift different objects' velocities by different amounts, depending on what their velocity already is, so that you don't cross the speed of light.

So the fact that velocity (and not just rapidity) is used all the time in special relativity is already a counterexample to this being required for velocity to make sense.

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-13T15:14:03.582Z · LW · GW

Yes, it's exactly the same except for the lack of symmetry. In particular, any quasiparticle can have any velocity (possibly up to some upper limit like the speed of light).

**Adam Scherlis (adam-scherlis)**on An exploration of GPT-2's embedding weights · 2022-12-13T01:39:24.448Z · LW · GW

Image layout is a little broken. I'll try to fix it tomorrow.

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-13T01:12:53.329Z · LW · GW

As far as I know, condensed matter physicists use velocity and momentum to describe quasiparticles in systems that lack both Galilean and Lorentzian symmetry. I would call that a causal model.

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-13T01:00:00.928Z · LW · GW

QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles.

Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe.

"Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe.

"Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata.

"String Theory": quantum theory of strings, and maybe branes, as you describe.*

"Quantum Mechanics" (strictly speaking): any of the above; quantum theory of anything.

You can do a change of basis in QFT and get something that looks like properties of particles (Fock space), and people do this very often, but the actual laws of physics in a QFT (the Lagrangian) can't be expressed nicely in the particle ontology because of nonperturbative effects. This doesn't come up often in practice -- I spent most of grad school thinking QFT was **agnostic **about whether fields or particles are fundamental -- but it's an important thing to recognize in a discussion about whether modern physics privileges one ontology over the other.

(Note that even in the imperfect particle ontology / Fock space picture, you don't have a finite-dimensional classical configuration space. 12 dimensions for 4 particles works great until you end up with a superposition of states with different particle numbers!)

String theory is as you describe, AFAIK, which is why I contrasted it to QFT. But maybe a real string theorist would tell me that nobody believes those strings are the fundamental degrees of freedom, just like particles aren't the fundamental degrees of freedom in QFT.

*Note: People sometimes use "string theory" to refer to weirder things like M-theory, where nobody knows which degrees of freedom to use...

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-12T09:11:22.764Z · LW · GW

Sure. I'd say that property is a lot stronger than "velocity exists as a concept", which seems like an unobjectionable statement to make about any theory with particles or waves or both.

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-12T07:29:40.294Z · LW · GW

Yeah, sorry for the jargon. "System with a boost symmetry" = "relativistic system" as tailcalled was using it above.

Quoting tailcalled:

Stuff like relativity is fundamentally about symmetry. You want to say that if you have some trajectory which satisfies the laws of physics, and some symmetry (such as "have everything move in direction at a speed of 5 m/s"), then must also satisfy the laws of physics.

A "boost" is a transformation of a physical trajectory ("trajectory" = complete history of things happening in the universe) that changes it by adding a fixed offset to everything's velocity; or equivalently, by making everything in the universe move in some direction while keeping all their relative velocities the same.

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-12T03:54:09.214Z · LW · GW

This seems too strong. Can't you write down a linear field theory with no (Galilean or Lorentzian) boost symmetry, but where waves still propagate at constant velocity? Just with a weird dispersion relation?

(Not confident in this, I haven't actually tried it and have spent very little time thinking about systems without boost symmetry.)

**Adam Scherlis (adam-scherlis)**on Consider using reversible automata for alignment research · 2022-12-12T03:51:31.766Z · LW · GW

And when things "move" it's just that they're making changes in the grid next to them, and some patterns just so happen to do so in a way where, after a certain period, it's the same pattern translated... is that what we think happens in our universe? Are electrons moving "just causal propagations"? Somehow this feels more natural for the Game of Life and less natural for physics.

This is what we think happens in our universe!

Both general relativity and quantum field theory are field theories: they have degrees of freedom at each point in space (and time), and objects that "move" are just an approximate description of propagating patterns of field excitations that reproduce themselves exactly in another location after some time.

The most accessible example of this is that light is an electromagnetic wave (a pattern of mutually-reinforcing electric and magnetic waves); photons aren't an additional part of the ontology, they're just a description of how electromagnetic waves work in a quantum universe.

(Quantum field theory can be* *described using particles to a very good degree of approximation, but the field formalism includes some observable phenomena that the particle formalism doesn't, so it has a strictly better claim to being fundamental.)

Beware, though; string theory may be what underlies QFT and GR, and it describes a world of stringy objects that actually do move through space... But at the very least, the cellular-automata perspective on "objects" and "motion" is not at all strange from a modern physics perspective.

EDIT: I might go so far as to claim that the reason all electrons are identical is the same as the reason all gliders are identical.

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2022-12-10T19:17:00.285Z · LW · GW

There are more characters than that in UTF-16, because it can represent the full Unicode range of >1 million codepoints. You're thinking of UCS-2 which is deprecated.

This puzzle isn't related to Unicode though

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2022-12-09T22:43:50.520Z · LW · GW

I like this, but it's not the solution I intended.

**Adam Scherlis (adam-scherlis)**on Adam Scherlis's Shortform · 2022-12-09T00:11:07.908Z · LW · GW

Solve the puzzle: 63 = x = 65536. What is x?

(I have a purpose for this and am curious about how difficult it is to find the intended answer.)

**Adam Scherlis (adam-scherlis)**on New Frontiers in Mojibake · 2022-11-26T02:45:26.018Z · LW · GW

♀︎

Fun fact: usually this is U+2640, but in this post it's U+2640 U+FE0E, where U+FE0E is a control character meaning "that was text, not emoji, btw". That should be redundant here, but LessWrong is pretty aggressive about replacing emojifiable text with emoji images.

Emoji are really cursed.

**Adam Scherlis (adam-scherlis)**on Cryptic symbols · 2022-10-31T21:04:09.604Z · LW · GW

Nope, not based on the shapes of numerals.

Hint: are you sure it's base 4?

**Adam Scherlis (adam-scherlis)**on Cryptic symbols · 2022-10-30T23:31:47.983Z · LW · GW

There's a reason for the "wrinkle" :)

**Adam Scherlis (adam-scherlis)**on Cryptic symbols · 2022-10-29T18:54:21.041Z · LW · GW

The 54-symbols thing was actually due to a bug, sorry!

**Adam Scherlis (adam-scherlis)**on Cryptic symbols · 2022-10-29T18:53:57.602Z · LW · GW

Ah, good catch about the relatively-few distinct symbols... that was actually because my image had a bug in it. Oooops.

Correct image is now at the top of the post.

**Adam Scherlis (adam-scherlis)**on Introduction to abstract entropy · 2022-10-26T01:18:40.035Z · LW · GW

Endorsed.

**Adam Scherlis (adam-scherlis)**on Introduction to abstract entropy · 2022-10-25T21:29:58.782Z · LW · GW

The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.

The measure that's used is uniform over each coordinate of position and momentum, for each particle.

This is pretty obvious and natural, but not forced on us, and:

- You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
- The level of coarse graining is unknown, so every quantity of entropy has an extra "+ log(# microstates per unit measure)" which is an unknown additive constant. (I think this is separate from the relationship between bits and J/K, which is a multiplicative constant for entropy -- k_B -- and doesn't rely on QM afaik.)

On the other hand, Liouville's theorem gives some pretty strong justification for using this measure, alleviating (1) somewhat:

https://en.wikipedia.org/wiki/Liouville%27s_theorem_(Hamiltonian)

In quantum mechanics, you have discrete energy eigenstates (...in a bound system, there are technicalities here...) and you can define a microstate to be an energy eigenstate, which lets you just count things and not worry about measure. This solves both problems:

- Counting microstates and taking the classical limit gives the "dx dp" (aka "dq dp") measure, ruling out any other measure.
- It tells you how big your microstates are in phase space (the answer is related to Planck's constant, which you'll note has units of position * momentum).

This section mostly talks about the question of coarse-graining, but you can see that "dx dp" is sort of put in by hand in the classical version: https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)#Counting_of_microstates

I wish I had a better citation but I'm not sure I do.

In general it seems like (2) is talked about more in the literature, even though I think (1) is more interesting. This could be because Liouville's theorem provides enough justification for most people's tastes.

Finally, knowing "how big your microstates are" is what tells you where quantum effects kick in. (Or vice versa -- Planck estimated the value of the Planck constant by adjusting the spacing of his quantized energy levels until his predictions for blackbody radiation matched the data.)

**Adam Scherlis (adam-scherlis)**on Introduction to abstract entropy · 2022-10-25T20:45:40.587Z · LW · GW

I think I was a little confused about your comment and leapt to one possible definition of S() which doesn't satisfy all the desiderata you had. Also, I don't like my definition anymore, anyway.

**Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.**

First things first:

We may perhaps think of fundamental "microstates" as (descriptions of) "possible worlds", or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible world is the actual world), every proposition can be seen as a disjunction of such possible worlds: the worlds in which the proposition is true.

I think this is indeed how we should think of "microstates". (I don't want to use the word "macrostate" at all, at this point.)

I was thinking of something like: given a probability distribution p and a proposition A, define

"S(A) under p" =

where the sums are over all microstates x in A. Note that the denominator is equal to p(A).

I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, or , but I think "log p" was not clearly "log p(x) for a microstate x" in my previous comment.

I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A.

I used T to mean a tautology (in this context: the full set of microstates).

Then I pointed out a couple consequences:

- Typically, when people talk about the "entropy of a macrostate A", they mean something equal to . Conceptually, this is based on the calculation , which is the same as either "S(A) under p_A" (in my goofy notation) or "S(T) under p_A", but I was claiming that you should think of it as the latter.
- The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to "S(T) under p" in this notation.
- Finally, for a microstate x in any distribution p, we get that "S({x}) under p" is equal to -log p(x).

All of this satisfied *my* goals of including the most prominent concepts in Alex's post:

- log |A| for a macrostate A
- Shannon/Gibbs entropy of a distribution p
- -log p(x) for a microstate x

And a couple other goals:

- Generalizing the Shannon/Gibbs entropy, which is , in a natural way to incorporate a proposition A (by making the expectation into a conditional expectation)
- Not doing too much violence to the usual meaning of "entropy of macrostate A" or "the entropy of p" in the process

But it did so at the cost of:

- making "the entropy of macrostate A" and "S(A) under p" two different things
- contradicting standard terminology and notation anyway
- reinforcing the dependence on microstates and the probabilities of microstates, contrary to what you wanted to do

So I would probably just ignore it and do your own thing.

**Adam Scherlis (adam-scherlis)**on Beyond Kolmogorov and Shannon · 2022-10-25T19:07:48.655Z · LW · GW

Sorry if this is a spoiler for your next post, but I take issue with the heading "Standard measures of information theory do not work" and the implication that this post contains the pre-Crutchfield state of the art.

The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn't to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it's to treat the generative model as a distribution for each bit *conditional on the previous bits* and use the *cross-entropy* of that distribution under the data distribution as the loss function or measure of goodness of the generative model.

So in this example, "look at the previous bits, identify the current position relative to the 01x01x pattern, and predict 0, 1, or [50-50 distribution] as appropriate" is the best you can do (given sufficient data for the 50-50 proportion to be reasonably accurate) and is indeed an accurate model of the process that generated the data.

We can see the pattern and take the current position into account because the distribution is conditional on previous bits.

Predicting 011011011... doesn't do as well because cross-entropy penalizes unwarranted overconfidence.

Predicting 50-50 for each bit doesn't do as well because cross-entropy still cares about successful predictions.

(Formally, cross-entropy is an expectation over the data distribution instead of an empirical average over a bunch of sampled data, but the term is used in both cases in practice. "Log[-likelihood] loss" and "the log scoring rule" are other common terms for the empirical version.)

As I said above, this isn't just a standard information theory approach to this, it's actually how GPT-3 and other LLMs were trained.

I'm curious about Crutchfield's thing, but so far not convinced that standard information theory isn't adequate in this context.

(I think Kolmogorov complexity is also relevant to LLM interpretability, philosophically if not practically, but that's beyond the scope of this comment.)

**Adam Scherlis (adam-scherlis)**on Introduction to abstract entropy · 2022-10-25T18:02:35.472Z · LW · GW

I agree with all the claims in this comment and I rather like your naming suggestions! Especially the "P-entropy of Q = Q-complexity of P" trick which seems to handle many use cases nicely.

(So the word "entropy" wasn't really my crux? Maybe not!)

**Adam Scherlis (adam-scherlis)**on Introduction to abstract entropy · 2022-10-25T04:17:43.915Z · LW · GW

I wanted to let that comment be about the interesting question of how we unify these various things.

But on the ongoing topic of "why not call all this entropy, if it's all clearly part of the same pattern?":

When the definition of some F(x) refers to x twice, it's often useful to replace one of them with y and call that G(x, y). But it's usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x),* *especially if you aren't going to mention both x and y every time you use it*, *and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]).

e.g.: x*y is not "the square of x and y" much less "the square of x [and y is implied from context]", and the dot product v.w is not "the norm-squared of v and w" etc.

[1] might I suggest "xentropy"?

**Adam Scherlis (adam-scherlis)**on Introduction to abstract entropy · 2022-10-25T04:14:33.006Z · LW · GW

From my perspective, the obvious rejoinder to "entropy is already two-place" is "insofar as entropy is two-place, cross-entropy is three-place!".

I think this is roughly where I'm at now.

After thinking a bit and peeking at Wikipedia, the situation seems to be:

The differential entropy of a probability density p is usually defined as

This is unfortunate, because it isn't invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is

where is a density function for some measure . We can also write this as

(Jaynes' continuous entropy of P with respect to )

in terms of a probability measure P with , which is a bit more clearly invariant.

Now we can define a cross-entropy-like thing as

(continuous cross-entropy of Q under P with respect to )

...and a small surprise is coming up. Jumping back to the discrete case, the KL divergence or "relative entropy" is

What happens when we try to write something analogous with our new continuous entropy and crossentropy? We get

which looks like the right generalization of discrete KL divergence, but also happens to be the continuous entropy of P with respect to Q! (The dependence on drops out.) This is weird but I think it might be a red herring.

We can recover the usual definitions in the discrete (finite or infinite) case by taking to be a uniform measure that assigns 1 to each state (note that this is NOT a probability measure -- I never said was a probability measure, and I don't know how to handle the countably infinite case if it is.)

So maybe the nicest single thing to define, for the sake of making everything else a concisely-specified special case, is

("H" chosen because it is used for both entropy and cross-entropy, afaik unlike "S").

We could take KL divergence as a two-argument atomic thing, or work with for a scalar function f and a distribution P, but I think these both make it cumbersome to talk about the things everyone wants to talk about all the time. I'm open to those possibilities, though.

The weird observation about KL divergence above is a consequence of the identity

but I think this is a slightly strange fact because R is a probability measure and is a general measure. Hmm.

We also have

in the case that is a probability measure, but not otherwise.

And I can rattle off some claims about special cases in different contexts:

- A (the?) fundamental theorem of entropy-like stuff is , or "crossentropy >= entropy".
- In the discrete case, practitioners of information theory and statistical physics tend to assume is uniform with measure 1 on each state, and (almost?) never discuss it directly. I'll call this measure U.
- They reserve the phrase "entropy of P" (in the discrete case) for H(P, P; U), and this turns out to be an extremely useful quantity.
- They use "cross-entropy of Q relative to P" to mean H(P, Q; U)
- They use "KL divergence from Q to P" (or "relative entropy ...") to mean H(P, Q; P) = -H(P, P; Q) = H(P, Q; U) - H(P, P; U)
- For a state s, let's define delta_s to be the distribution that puts measure 1 on s and 0 everywhere else.
- For a code C, let's define f_C to be the "coin flip" distribution that puts probability 2^-len(C(s)) on state s (or the sum of this over all codewords for s, if there are several). (Problem: what if C doesn't assign all its available strings, so that this isn't a probability distribution? idk.)
- (see also https://en.wikipedia.org/wiki/Cross_entropy#Motivation)
- If a state s has only one codeword C(s), the length of that codeword is len(C(s)) = H(delta_s, f_C; U)
- Note that this is not H(delta_s, delta_s; U), the entropy of delta_s, which is always zero. This is a lot of why I don't like taking "entropy of s" to mean H(delta_s, f_C; U).
- Information theorists use "surprisal of s under P" (or "self-information ...") to mean H(delta_s, P; U)
- The expected length of a code C, for a source distribution P, is H(P, f_C; U)
- Consequence from fundamental thm: a code is optimal iff f_C = P; the minimal expected length of the code is the entropy of P
- If you use the code C when you should've used D, your expected length is H(f_D, f_C; U) and you have foolishly wasted H(f_D, f_C; f_D) bits in expectation.
- Physicists use "entropy of a macrostate" to mean things like H(p_A, p_A; U) where p_A is a uniform probability measure over a subset A of the state space, and possibly also for other things of the form H(P, P; U) where P is some probability measure that corresponds to your beliefs if you only know macroscopic things.
- Kolmogorov complexity ("algorithmic entropy") of a state s is, as you argue, maybe best seen as an approximation of H(delta_s, p_Solomonoff; U). I don't see a good way of justifying the name by analogy to any "entropy" in any other context, but maybe I'm missing something.
- This makes that one quote from Wiktionary into something of the form: "H(P, p_Solomonoff; U) ~= H(P, P, U) whenever P is computed by a short program". This makes sense for some sense of "~=" that I haven't thought through.
- When working in continuous spaces like R, people typically define "[differential] entropy of P" as H(P, P; L) where L is the Lebesgue measure (including the uniform measure on R). If they are working with coordinate transformations from R to R, they implicitly treat one variable as the "true" one which the measure should be uniform on, often without saying so.
- They define cross-entropy as H(P, Q; L)....
- ...and KL divergence as H(P, Q; L) - H(P, P; L) = H(P, Q; P) (same as discrete case!)
- But sometimes people use Jaynes' definitions, which replace L with an arbitrary (and explicitly-specified) measure, like we're doing here. In that case, "entropy" is H(P, P; mu) for an arbitrary mu.
- The properness of the logarithmic scoring rule also follows from the fundamental theorem. Your (subjective) expected score is H(your_beliefs, your_stated_beliefs; U) and this is minimized for your_stated_beliefs = your_beliefs.

**Adam Scherlis (adam-scherlis)**on Introduction to abstract entropy · 2022-10-24T18:11:21.017Z · LW · GW

My argument above is ofc tuned to case (2), and it's plausible to me that it pushes you off the fence towards "no wiggle room".

Yup, I think I am happy to abandon the wiggle room at this point, for this reason.

if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate involving a distribution and some sort of measure.

~~I think my preferred approach to this is that the density p(x) is not really the fundamental object, and should be thought of as dP/dmu(x), with the measure in the denominator. We multiply by dmu(x) in the integral for entropy in order to remove this dependence on mu that we accidentally introduced.~~ EDIT: this is flagrantly wrong because log(p) depends on the measure also. You're right that this is really a function of the distribution and the measure; I'm not sure offhand if it's crossentropy, either, but I'm going to think about this more. (This is an embarrassing mistake because I already knew differential entropy was cursed with dependence on a measure -- quantum mechanics famously provides the measure on phase-space that classical statistical mechanics took as axiomatic.)

For what it's worth, I've heard the take "entropy and differential entropy are different sorts of things" several times; I might be coming around to that, now that I see another slippery slope on the horizon.