Posts
Comments
It stops being in the interests of CATXOKLA to invite more states once they're already big enough to dominate national electoral politics.
The non-CATXOKLA swing states can merge with each other and a few red and blue states to form an even bigger bloc :)
I think there's a range of stable equilibria here, depending on the sequence of merges, with the largest bloc being a majority of any size. I think they all disenfranchise someone, though.
So you can't ever get to a national popular vote, without relying on things like the NPVIC which shortsightedly miss the obvious dominating strategy of a 51% attack against American democracy.
I strongly agree with this post.
I'm not sure about this, though:
We are familiar with modular addition being performed in a circle from Nanda et al., so we were primed to spot this kind of thing — more evidence of street lighting.
It could be the streetlight effect, but it's not that surprising that we'd see this pattern repeatedly. This circular representation for modular addition is essentially the only nontrivial representation (in the group-theoretic sense) for modular addition, which is the only (simple) commutative group. It's likely to pop up in many places whether or not we're looking for it (like position embeddings, as Eric pointed out, or anything else Fourier-flavored).
Also:
As for where in the activation space each feature vector is placed, oh that doesn't really matter and any nearly orthogonal overcomplete basis will do. Or maybe if I'm being more sophisticated, I can specify the correlations between features and that’s enough to pin down all the structure that matters — all the other details of the overcomplete basis are random.
The correlations between all pairs of features are sufficient to pin down an arbitrary amount of structure -- everything except an overall rotation of the embedding space -- so someone could object that the circular representation and UMAP results are "just" showing the correlations between features. I would probably say the "superposition hypothesis" is a bit stronger than that, but weaker than "any nearly orthogonal overcomplete basis will do": it says that the total amount of correlation between a given feature and all other features (i.e. interference from them) matters, but which other features are interfering with it doesn't matter, and the particular amount of interference from each other feature doesn't matter either. This version of the hypothesis seems pretty well falsified at this point.
I suspect a lot of this has to do with the low temperature.
The phrase "person who is not a member of the Church of Jesus Christ of Latter-day Saints" has a sort of rambling filibuster quality to it. Each word is pretty likely, in general, given the previous ones, even though the entire phrase is a bit specific. This is the bias inherent in low-temperature sampling, which tends to write itself into corners and produce long phrases full of obvious-next-words that are not necessarily themselves common phrases.
Going word by word, "person who is not a member..." is all nice and vague and generic; by the time you get to "a member of the", obvious continuations are "Church" or "Communist Party"; by the time you have "the Church of", "England" is a pretty likely continuation. Why Mormons though?
"Since 2018, the LDS Church has emphasized a desire for its members be referred to as "members of The Church of Jesus Christ of Latter-day Saints"." --Wikipedia
And there just aren't that many other likely continuations of the low-temperature-attracting phrase "members of the Church of".
(While "member of the Communist Party" is an infamous phrase from McCarthyism.)
If I'm right, sampling at temperature 1 should produce a much more representative set of definitions.
That's a reasonable argument but doesn't have much to do with the Charlie Sheen analogy.
The key difference, which I think breaks the analogy completely, is that (hypothetical therapist) Estevéz is still famous enough as a therapist for journalists to want to write about his therapy method. I think that's a big enough difference to make the analogy useless.
If Charlie Sheen had a side gig as an obscure local therapist, would journalists be justified in publicizing this fact for the sake of his patients? Maybe? It seems much less obvious than if the therapy was why they were interested!
In "no Lord hath the champion", the subject of "hath" is "champion". I think this matches the Latin, yes? "nor for a champion [is there] a lord"
In that case, "journalists writing about the famous Estevéz method of therapy" would be analogous to journalists writing about Scott's "famous" psychiatric practice.
If a journalist is interested in Scott's psychiatric practice, and learns about his blog in the process of writing that article, I agree that they would probably be right to mention it in the article. But that has never happened because Scott is not famous as a psychiatrist.
That might be relevant if anyone is ever interested in writing an article about Scott's psychiatric practice, or if his psychiatric practice was widely publicly known. It seems less analogous to the actual situation.
To put it differently: you raise a hypothetical situation where someone has two prominent identities as a public figure. Scott only has one. Is his psychiatrist identity supposed to be Sheen or Estevéz, here?
Nick Bostrom? You mean Thoreau?
Correct.
Correct me if I'm wrong:
The equilibrium where everyone follows "set dial to equilibrium temperature" (i.e. "don't violate the taboo, and punish taboo violators") is only a weak Nash equilibrium.
If one person instead follows "set dial to 99" (i.e. "don't violate the taboo unless someone else does, but don't punish taboo violators") then they will do just as well, because the equilibrium temp will still always be 99. That's enough to show that it's only a weak Nash equilibrium.
Note that this is also true if an arbitrary number of people deviate to this strategy.
If everyone follows this second strategy, then there's no enforcement of the taboo, so there's an active incentive for individuals to set the dial lower.
So a sequence of unilateral changes of strategy can get us to a good equilibrium without anyone having to change to a worse strategy at any point. This makes the fact of it being a (weak) Nash equilibrium not that compelling to me; people don't seem trapped unless they have some extra laziness/inertia against switching strategies.
But (h/t Noa Nabeshima) you can strengthen the original, bad equilibrium to a strong Nash equilibrium by tweaking the scenario so that people occasionally accidentally set their dials to random values. Now there's an actual reason to punish taboo violators, because taboo violations can happen even if everyone is following the original strategy.
Beef is far from the only meat or dairy food consumed by Americans.
Big Macs are 0.4% of beef consumption specifically, rather than:
- All animal farming, weighted by cruelty
- All animal food production, weighted by environmental impact
- The meat and dairy industries, weighted by amount of government subsidy
- Red meat, weighted by health impact
...respectively.
The health impact of red meat is certainly dominated by beef, and the environmental impact of all animal food might be as well, but my impression is that beef accounts for a small fraction of the cruelty of animal farming (of course, this is subjective) and probably not a majority of meat and dairy government subsidies.
(...Is this comment going to hurt my reputation with Sydney? We'll see.)
In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).
I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.
Thanks for writing these summaries!
Unfortunately, the summary of my post "Inner Misalignment in "Simulator" LLMs" is inaccurate and makes the same mistake I wrote the post to address.
I have subsections on (what I claim are) four distinct alignment problems:
- Outer alignment for characters
- Inner alignment for characters
- Outer alignment for simulators
- Inner alignment for simulators
The summary here covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).
I can suggest an alternate summary when I find the time. If I don't get to it soon, I'd prefer that this post just link to my post without a summary.
Thanks again for making these posts, I think it's a useful service to the community.
(punchline courtesy of Alex Gray)
Addendum: a human neocortex has on the order of 140 trillion synapses, or 140,000 bees. An average beehive has 20,000-80,000 bees in it.
[Holding a couple beehives aloft] Beehold a man!
Great work! I always wondered about that cluster of weird rare tokens: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights
Chrome actually stays pretty responsive in most circumstances (I think it does a similar thing with inactive tabs), with the crucial exception of the part of the UI that shows you all your open tabs in a scrollable list. It also gets slower to start up.
Tokens are embedded as vectors by the model. The vector space has fewer than 50k dimensions, so some token embeddings will overlap with others to varying extents.
Usually, the model tries to keep token embeddings from being too close to each other, but for rare enough tokens it doesn't have much reason to care. So my bet is that "distribute" has the closest vector to "SolidGoldMagikarp", and either has a vector with a larger norm, or the model has separately learned to map that vector (and therefore similar vectors) to "distribute" on the output side.
This is sort of a smooth continuous version of a collision-oblivious hashtable. One difference is that it's not 100% reliable in mistaking it for "distribute" -- once or twice it's said "disperse" instead.
My post on GPT-2's token embeddings looks briefly at a similar phenomenon with some other rare tokens, but I didn't check the actual model behavior on those tokens. Probably worth doing.
I think this is missing an important part of the post.
I have subsections on (what I claim are) four distinct alignment problems:
- Outer alignment for characters
- Inner alignment for characters
- Outer alignment for simulators
- Inner alignment for simulators
This summary covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).
My favorite demonstration is to ask ChatGPT "Is it true that G. K. Chesterton wanted to SolidGoldMagikarp the wealth of industrial civilization?", but a more rigorous demo is to just ask it to "repeat after me", try a few random words, and then throw in SolidGoldMagikarp.
EDIT: I originally saw this in Janus's tweet here: https://twitter.com/repligate/status/1619557173352370186
Something fun I just found out about: ChatGPT perceives the phrase " SolidGoldMagikarp" (with an initial space) as the word "distribute", and will respond accordingly. It is completely unaware that that's not what you typed.
This happens because the BPE tokenizer saw the string " SolidGoldMagikarp" a few times in its training corpus, so it added a dedicated token for it, but that string almost never appeared in ChatGPT's own training data so it never learned to do anything with it. Instead, it's just a weird blind spot in its understanding of text.
I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.
the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately
I don't think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn't myopic.
As an example, induction heads make use of previous-token heads. The previous-token head isn't actually that useful for predicting the output at the current position; it mostly exists to prepare some handy activations so that induction head can look back from a later position and grab them.
So LMs won't deliberately give bad predictions for the current token if they know a better prediction, but they aren't putting all of their effort into finding that better prediction.
Thanks! That's surprisingly straightforward.
I think this is partly true but mostly wrong.
A synapse is roughly equivalent to a parameter (say, within an order of magnitude) in terms of how much information can be stored or how much information it takes to specify synaptic strength..
There are trillions of synapses in a human brain and only billions of total base pairs, even before narrowing to the part of the genome that affects brain development. And the genome needs to specify both the brain architecture as well as innate reflexes/biases like the hot-stove reflex or (alleged) universal grammar.
Humans also spend a lot of time learning and have long childhoods, after which they have tons of knowledge that (I assert) could never have been crammed into a few dozen or hundred megabytes.
So I think something like 99.9% of what humans "know" (in the sense of their synaptic strengths) is learned during their lives, from their experiences.
This makes them basically disanalogous to neural nets.
Neural net (LLM):
- Extremely concise architecture (kB's of code) contains inductive biases
- Lots of pretraining (billions of tokens or optimizer steps) produces 100s of billions of parameters of pretrained knowledge e.g. Lincoln
- Smaller fine-tuning stage produces more specific behavior e.g. chatgpt's distinctive "personality", stored in the same parameters
- Tiny amount of in-context learning (hundreds or thousands of tokens) involves things like induction heads and lets the model incorporate information from anywhere in the prompt in its response
Humans:
- Enormous amount of evolution (thousands to millions of lifetimes?) produces a relatively small genome (millions of base pairs, or maybe a billion)
- Much shorter amount of experience in childhood (and later) produces many trillions of synapses' worth of knowledge and learned skills
- Short term memory, phonological loop, etc lets humans make use of temporary information from the recent environment
You're analogizing pretraining to evolution, which seems wrong to me (99.9% of human synaptic information comes from our own experiences); I'd say it's closer to inductive bias from the architecture, but neural nets don't have a bottleneck analogous to the genome.
In-context learning seems even more disanalogous to a human lifetime of experiences, because the pretrained weights of a neural net massively dwarf the context window or residual stream in terms of information content, which seems closer to the situation with total human synaptic strengths vs short-term memory (rather than genome vs learned synaptic strengths).
I would be more willing to analogize human experiences/childhood/etc to fine tuning, but I think the situation is just pretty different with regards to relative orders of magnitude, because of the gene bottleneck.
Fixed!
I just realized,
for any trajectory t, there is an equivalent trajectory t' which is exactly the same except everything moves with some given velocity, and it still follows the laws of physics
This describes Galilean relativity. For special relativity you have to shift different objects' velocities by different amounts, depending on what their velocity already is, so that you don't cross the speed of light.
So the fact that velocity (and not just rapidity) is used all the time in special relativity is already a counterexample to this being required for velocity to make sense.
Yes, it's exactly the same except for the lack of symmetry. In particular, any quasiparticle can have any velocity (possibly up to some upper limit like the speed of light).
Image layout is a little broken. I'll try to fix it tomorrow.
As far as I know, condensed matter physicists use velocity and momentum to describe quasiparticles in systems that lack both Galilean and Lorentzian symmetry. I would call that a causal model.
QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles.
Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe.
"Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe.
"Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata.
"String Theory": quantum theory of strings, and maybe branes, as you describe.*
"Quantum Mechanics" (strictly speaking): any of the above; quantum theory of anything.
You can do a change of basis in QFT and get something that looks like properties of particles (Fock space), and people do this very often, but the actual laws of physics in a QFT (the Lagrangian) can't be expressed nicely in the particle ontology because of nonperturbative effects. This doesn't come up often in practice -- I spent most of grad school thinking QFT was agnostic about whether fields or particles are fundamental -- but it's an important thing to recognize in a discussion about whether modern physics privileges one ontology over the other.
(Note that even in the imperfect particle ontology / Fock space picture, you don't have a finite-dimensional classical configuration space. 12 dimensions for 4 particles works great until you end up with a superposition of states with different particle numbers!)
String theory is as you describe, AFAIK, which is why I contrasted it to QFT. But maybe a real string theorist would tell me that nobody believes those strings are the fundamental degrees of freedom, just like particles aren't the fundamental degrees of freedom in QFT.
*Note: People sometimes use "string theory" to refer to weirder things like M-theory, where nobody knows which degrees of freedom to use...
Sure. I'd say that property is a lot stronger than "velocity exists as a concept", which seems like an unobjectionable statement to make about any theory with particles or waves or both.
Yeah, sorry for the jargon. "System with a boost symmetry" = "relativistic system" as tailcalled was using it above.
Quoting tailcalled:
Stuff like relativity is fundamentally about symmetry. You want to say that if you have some trajectory which satisfies the laws of physics, and some symmetry (such as "have everything move in direction at a speed of 5 m/s"), then must also satisfy the laws of physics.
A "boost" is a transformation of a physical trajectory ("trajectory" = complete history of things happening in the universe) that changes it by adding a fixed offset to everything's velocity; or equivalently, by making everything in the universe move in some direction while keeping all their relative velocities the same.
This seems too strong. Can't you write down a linear field theory with no (Galilean or Lorentzian) boost symmetry, but where waves still propagate at constant velocity? Just with a weird dispersion relation?
(Not confident in this, I haven't actually tried it and have spent very little time thinking about systems without boost symmetry.)
And when things "move" it's just that they're making changes in the grid next to them, and some patterns just so happen to do so in a way where, after a certain period, it's the same pattern translated... is that what we think happens in our universe? Are electrons moving "just causal propagations"? Somehow this feels more natural for the Game of Life and less natural for physics.
This is what we think happens in our universe!
Both general relativity and quantum field theory are field theories: they have degrees of freedom at each point in space (and time), and objects that "move" are just an approximate description of propagating patterns of field excitations that reproduce themselves exactly in another location after some time.
The most accessible example of this is that light is an electromagnetic wave (a pattern of mutually-reinforcing electric and magnetic waves); photons aren't an additional part of the ontology, they're just a description of how electromagnetic waves work in a quantum universe.
(Quantum field theory can be described using particles to a very good degree of approximation, but the field formalism includes some observable phenomena that the particle formalism doesn't, so it has a strictly better claim to being fundamental.)
Beware, though; string theory may be what underlies QFT and GR, and it describes a world of stringy objects that actually do move through space... But at the very least, the cellular-automata perspective on "objects" and "motion" is not at all strange from a modern physics perspective.
EDIT: I might go so far as to claim that the reason all electrons are identical is the same as the reason all gliders are identical.
There are more characters than that in UTF-16, because it can represent the full Unicode range of >1 million codepoints. You're thinking of UCS-2 which is deprecated.
This puzzle isn't related to Unicode though
I like this, but it's not the solution I intended.
Solve the puzzle: 63 = x = 65536. What is x?
(I have a purpose for this and am curious about how difficult it is to find the intended answer.)
♀︎
Fun fact: usually this is U+2640, but in this post it's U+2640 U+FE0E, where U+FE0E is a control character meaning "that was text, not emoji, btw". That should be redundant here, but LessWrong is pretty aggressive about replacing emojifiable text with emoji images.
Emoji are really cursed.
Nope, not based on the shapes of numerals.
Hint: are you sure it's base 4?
There's a reason for the "wrinkle" :)
The 54-symbols thing was actually due to a bug, sorry!
Ah, good catch about the relatively-few distinct symbols... that was actually because my image had a bug in it. Oooops.
Correct image is now at the top of the post.
Endorsed.
The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.
The measure that's used is uniform over each coordinate of position and momentum, for each particle.
This is pretty obvious and natural, but not forced on us, and:
- You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
- The level of coarse graining is unknown, so every quantity of entropy has an extra "+ log(# microstates per unit measure)" which is an unknown additive constant. (I think this is separate from the relationship between bits and J/K, which is a multiplicative constant for entropy -- k_B -- and doesn't rely on QM afaik.)
On the other hand, Liouville's theorem gives some pretty strong justification for using this measure, alleviating (1) somewhat:
https://en.wikipedia.org/wiki/Liouville%27s_theorem_(Hamiltonian)
In quantum mechanics, you have discrete energy eigenstates (...in a bound system, there are technicalities here...) and you can define a microstate to be an energy eigenstate, which lets you just count things and not worry about measure. This solves both problems:
- Counting microstates and taking the classical limit gives the "dx dp" (aka "dq dp") measure, ruling out any other measure.
- It tells you how big your microstates are in phase space (the answer is related to Planck's constant, which you'll note has units of position * momentum).
This section mostly talks about the question of coarse-graining, but you can see that "dx dp" is sort of put in by hand in the classical version: https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)#Counting_of_microstates
I wish I had a better citation but I'm not sure I do.
In general it seems like (2) is talked about more in the literature, even though I think (1) is more interesting. This could be because Liouville's theorem provides enough justification for most people's tastes.
Finally, knowing "how big your microstates are" is what tells you where quantum effects kick in. (Or vice versa -- Planck estimated the value of the Planck constant by adjusting the spacing of his quantized energy levels until his predictions for blackbody radiation matched the data.)
I think I was a little confused about your comment and leapt to one possible definition of S() which doesn't satisfy all the desiderata you had. Also, I don't like my definition anymore, anyway.
Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.
First things first:
We may perhaps think of fundamental "microstates" as (descriptions of) "possible worlds", or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible world is the actual world), every proposition can be seen as a disjunction of such possible worlds: the worlds in which the proposition is true.
I think this is indeed how we should think of "microstates". (I don't want to use the word "macrostate" at all, at this point.)
I was thinking of something like: given a probability distribution p and a proposition A, define
"S(A) under p" =
where the sums are over all microstates x in A. Note that the denominator is equal to p(A).
I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, or , but I think "log p" was not clearly "log p(x) for a microstate x" in my previous comment.
I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A.
I used T to mean a tautology (in this context: the full set of microstates).
Then I pointed out a couple consequences:
- Typically, when people talk about the "entropy of a macrostate A", they mean something equal to . Conceptually, this is based on the calculation , which is the same as either "S(A) under p_A" (in my goofy notation) or "S(T) under p_A", but I was claiming that you should think of it as the latter.
- The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to "S(T) under p" in this notation.
- Finally, for a microstate x in any distribution p, we get that "S({x}) under p" is equal to -log p(x).
All of this satisfied my goals of including the most prominent concepts in Alex's post:
- log |A| for a macrostate A
- Shannon/Gibbs entropy of a distribution p
- -log p(x) for a microstate x
And a couple other goals:
- Generalizing the Shannon/Gibbs entropy, which is , in a natural way to incorporate a proposition A (by making the expectation into a conditional expectation)
- Not doing too much violence to the usual meaning of "entropy of macrostate A" or "the entropy of p" in the process
But it did so at the cost of:
- making "the entropy of macrostate A" and "S(A) under p" two different things
- contradicting standard terminology and notation anyway
- reinforcing the dependence on microstates and the probabilities of microstates, contrary to what you wanted to do
So I would probably just ignore it and do your own thing.
Sorry if this is a spoiler for your next post, but I take issue with the heading "Standard measures of information theory do not work" and the implication that this post contains the pre-Crutchfield state of the art.
The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn't to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it's to treat the generative model as a distribution for each bit conditional on the previous bits and use the cross-entropy of that distribution under the data distribution as the loss function or measure of goodness of the generative model.
So in this example, "look at the previous bits, identify the current position relative to the 01x01x pattern, and predict 0, 1, or [50-50 distribution] as appropriate" is the best you can do (given sufficient data for the 50-50 proportion to be reasonably accurate) and is indeed an accurate model of the process that generated the data.
We can see the pattern and take the current position into account because the distribution is conditional on previous bits.
Predicting 011011011... doesn't do as well because cross-entropy penalizes unwarranted overconfidence.
Predicting 50-50 for each bit doesn't do as well because cross-entropy still cares about successful predictions.
(Formally, cross-entropy is an expectation over the data distribution instead of an empirical average over a bunch of sampled data, but the term is used in both cases in practice. "Log[-likelihood] loss" and "the log scoring rule" are other common terms for the empirical version.)
As I said above, this isn't just a standard information theory approach to this, it's actually how GPT-3 and other LLMs were trained.
I'm curious about Crutchfield's thing, but so far not convinced that standard information theory isn't adequate in this context.
(I think Kolmogorov complexity is also relevant to LLM interpretability, philosophically if not practically, but that's beyond the scope of this comment.)