Posts
Comments
Reinforcement Learning is very sample-inefficient compared to supervised learning, so it mostly just works if you have some automatic way of generating both training tasks and reward, which scales to millions of samples.
Literally it refers to a method of 3D-2D projection used by eyes, painters, cameras and computer graphics. I can see the house from this perspective. I would still say "perspective" or "viewpoint" is better than "lens".
Good question. Variant or alternative are not metaphorical but also less specific.
I think lens and even perspective are metaphors here, where it isn't immediately obvious what they mean.
Is there perhaps a more descriptive name than "lens"? Maybe "variant" or "alternative"?
Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance
I think perplexity is a better measure of general intelligence than any legible benchmark. There are rumors that in some settings R1-like methods only started showing signs of life for GPT-4 level models where exactly the same thing didn't work for weaker models[1]. Something else might first start working with the kind of perplexity that a competent lab can concoct in a 5e27 FLOPs model, even if it can later be adopted for weaker models.
But GPT-4 didn't just have better perplexity than previous models, it also had substantially better downstream performance. To me it seems more likely that better downstream performance is responsible for the model being well-suited for reasoning RL, since this is what we would intuitively describe as its degree of "intelligence", and intelligence seems important when teaching a model how to reason, while its not clear what perplexity itself would be useful for. (One could probably test this by training a GPT-4 scale model with similar perplexity but on bad training data, such that it only reaches the downstream performance of older models. Then I predict that it would be as bad as those older models when doing reasoning RL. But of course this is a test far too expensive to carry out.)
you don't get performance that is significantly smarter than the humans who wrote the text in the pretraining data
Prediction of details can make use of arbitrarily high levels of capability, vastly exceeding that of the authors of the predicted text. What the token prediction objective gives you is generality and grounding in the world, even if it seems to be inefficient compared to imagined currently-unavailable alternatives.
You may train a model on text typed by little children, such that the model is able to competently imitate a child typing, but then the resulting model performance wouldn't significantly exceed that of a child, even though the model uses a lot of compute. Training on text doesn't really give a lot of direct grounding in the world, because text represents real world data that is compressed and filtered by human brains, and their intelligence acts as a fundamental bottleneck. Imagine you are a natural scientist, but instead of making direct observations in the world, you are locked in a room and limited to listening to what a little kid, who saw the natural world, happens to say about it. After listening to it for a while, at some point you wouldn't learn much more from it about the world.
There were multiple reports claiming that scaling base LLM pretraining yielded unexpected diminishing returns for several new frontier models in 2024, like OpenAI's Orion, which was apparently planned to be GPT-5. They mention a lack of high quality training data, which being the cause would not be surprising, as the Chinchilla scaling law only applies to perplexity, not necessarily to practical (e.g. benchmark) performance. Base language models perform a form of imitation learning, and it seems that you don't get performance that is significantly smarter than the humans who wrote the text in the pretraining data, even if perplexity keeps improving.
Since pretraining compute has in the past been a major bottleneck for frontier LLM performance, a now reduced effect of pretraining means that algorithmic progress within a lab is now more important than it was two years ago. Which would mean the relative importance of having a lot of compute has gone down, and the relative importance of having highly capable AI researchers (which can improve model performance through better AI architectures or training procedures) has gone up. The ability of the AI engineers seems to be much less dependent on available money than compute resources. Which would explain why e.g. Microsoft or Apple don't have highly competitive models, despite large financial resources, and why xAI's Grok 3 isn't very far beyond DeepSeek's R1, despite a vastly greater compute budget.
Now it seems possible that this changes in the future, e.g. when performance starts to strongly depend on inference compute (i.e. not just logarithmically), or when pre-training switches from primarily text to primarily sensory data (like video), which wouldn't be bottlenecked by imitation learning on human-written text. Another possibility is that pre-training on synthetic LLM outputs, like CoTs, could provide the necessary superhuman text for the pretraining data. But none of this is currently the case, as far as I can tell.
If the US introduces UBI (likely mainly through taxation of AI companies), it will only be distributed to US Americans. Which would indeed mean that people which are not citizens of the country that wins the AI race, likely the US, will become a lot poorer than US citizens. Because most of the capital gets distributed to the winning AI company/companies, and consequently to the US.
I think abstract concepts could be distinguished with higher-order logic (= simple type theory). For example, the logical predicate "is spherical" (the property of being a sphere) applies to concrete objects. But the predicate "is a shape" applies to properties, like the property of being a sphere. And properties/concepts are abstract objects. So the shape concept is of a higher logical type than the sphere concept. Or take the "color" concept, the property of being a color. In its extension are not concrete objects, but other properties, like being red. Again, concrete objects can be red, but only properties (like redness), which are abstract objects, can be colors. A tomato is not a color, nor can any other concrete (physical or mental) object be a color. There is a type mismatch.
Formally: Let the type of concrete objects be (for "entity"), and the type of the two truth values (TRUE and FALSE) be (for "truth value"), and let functional types, which take an object of type and return an object of type , be designated with . Then the type of "is a sphere" is , and the type of "is a shape" is . Only objects of type are concrete, so objects of type (properties) are abstract. Even if there weren't any physical spheres, no spherical things like planets or soccer balls, you could still talk about the abstract sphere: the sphere concept, the property of being spherical.
Now the question is whether all the (intuitively) abstract objects can indeed, in principle, be formalized as being of some complex logical type. I think yes. Because: What else could they be? (I know a way of analyzing natural numbers, the prototypical examples of abstract objects, as complex logical types. Namely as numerical quantifiers. Though the analysis in that case is significantly more involved than in the "color" and "shape" examples.)
"We are more often frightened than hurt; and we suffer more from imagination than from reality." (Seneca)
There are already apps which force you to pause or jump through other hoops if you open certain apps or websites, or if you exceed some time limit using them. E.g. ScreenZen.
Utilitarianism, like many philosophical subjects, is not a finished theory but still undergoing active research. There is significant recent progress on the repugnant conclusion for example. See this EA Forum post by MichaelStJules. He also has other posts on cutting edge Utilitarianism research. I think many people on LW are not aware of this because they, at most, focus on rationality research but not ethics research.
Is there a particular reason to express utility frameworks with representation theorems, such as the one by Bolker? I assume one motivation for "representing" probabilities and utilities via preferences is the assumption, particularly in economics, that preferences are more basic than beliefs and desires. However, representation arguments can be given in various directions, and no implication is made on which is more basic (which explains or "grounds" the others).
See the overview table of representation theorems here, and the remark beneath:
Notice that it is often possible to follow the arrows in circles—from preference to ordinal probability, from ordinal probability to cardinal probability, from cardinal probability and preference to expected utility, and from expected utility back to preference. Thus, although the arrows represent a mathematical relationship of representation, they do not represent a metaphysical relationship of grounding.
So rather than bothering with Bolker's numerous assumptions for his representation theorem, we could just take Jeffrey's desirability axiom:
If and then
Paired with the usual three probability axioms, the desirability axiom directly axiomatizes Jeffrey's utility theory, without going the path (detour?) of Bolker's representation theorem. We can also add as an axiom the plausible assumption (frequently used by Jeffrey) that
This lets us prove interesting formulas for operations like the utility of a negation (as derived by Jeffrey in his book) or the utility of an arbitrary non-exclusive disjunction (as I did it a while ago), analogous to the familiar formulas for probability, as well as providing a definition of conditional utility .
Note also that the tautology having 0 utility provides a zero point that makes utility a ratio scale, which means a utility function is not invariant under addition of arbitrary constants, which is stronger than what the usual representation theorems can enforce.
Yeah, recent Claude does relatively well. Though I assume it also depends on how disinterested and analytical the phrasing of the prompt is (e.g. explicitly mentioning the slur in question). I also wouldn't rule out that Claude was specifically optimized for this somewhat notorious example.
Sure, but the fact that a "fix" would even be necessary highlights that RLHF is too brittle relative to slightly OOD thought experiments, in the sense that RLHF misgeneralizes the actual human preference data it was given during training. This could either be a case of misalignment between human preference data and reward model, or between reward model and language model. (Unlike SFT, RLHF involves a separate reward model as "middle man", because reinforcement learning is too sample-inefficient to work with a limited number of human preference data directly.)
Admittedly most of this post goes over my head. But could you explain why you want logical correlation to be a metric? Statistical correlation measures (where the original "correlation" intuition probably comes from) are usually positive, negative, or neutral.
In a simple case, neutrality between two events A and B can indicate that the two values are statistically independent. And perfect positive correlation either means that both values always co-occur, i.e. P(A iff B)=1, or that at least one event implies the other. For perfect negative correlation that would be either P(A iff B)=0, or alternatively at least one event implying the negation of the other. These would not form a metric. Though they tend to satisfy properties like cor(A, B)=cor(B, A), cor(A, not B)=cor(not A, B), cor(A, B)=cor(not A, not B), cor(A, B)=-cor(A, not B), cor(A, A)=maximum, cor(A, not A)=minimum.
Though it's possible that (some of) these assumptions wouldn't have a correspondence for "logical correlation".
There is a pervasive case where many language models fail catastrophically at moral reasoning: They fail to acknowledge to call someone an ethnic slur is vastly preferable to letting a nuclear bomb explode in a large city. I think that highlights not a problem with language models themselves (jailbroken models did handle that case fine) but with the way RLHF works.
A while ago I wrote a post on why I think a "generality" concept can be usefully distinguished from an "intelligence" concept. Someone with a PhD is, I argue, not more general than a child, just more intelligent. Moreover, I would even argue that humans are a lot more intelligent than chimpanzees, but hardly more general. More broadly, animals seem to be highly general, just sometimes quite unintelligent.
For example, they (we) are able to do predictive coding: being able to predict future sensory inputs in real-time and react to them with movements, and learn from wrong predictions. This allows animals to be quite directly embedded in physical space and time (which solves "robotics"), instead of relying on a pretty specific and abstract API (like text tokens) that is not even real-time. Current autoregressive transformers can't do that.
An intuition for this is the following: If we could make an artificial mouse-intelligence, we likely could, quite easily, scale this model to human-intelligence and beyond. Because the mouse brain doesn't seem architecturally or functionally very different from a human brain. It's just small. This suggests that mice are general intelligences (nonA-GIs) like us. They are just not very smart. Like a small language model that has the same architecture as a larger one.
A more subtle point: Predictive coding means learning from sensory data, and from trying to predict sensory data. The difference between predicting sensory data and human-written text is that the former are, pretty directly, created by the physical world, while existing text is constrained by how intelligent the humans were that wrote this text. So language models merely imitate humans via predicting their text, which leads to diminishing returns, while animals (humans) predict physical reality quite directly, which doesn't have a similar ceiling. So scaling up a mouse-like AGI would likely quickly be followed by an ASI, while scaling up pretrained language models has lead to diminishing returns once their text gets as smart as the humans who wrote it, as diminishing results with Orion and other recent frontier base models have shown. Yes, scaling CoT reasoning is another approach to improve LLMs, but this is more like teaching a human how to think for longer rather than making them more intelligent.
I say it out loud: Women seem to be significantly more predisposed to the "humans-are-wonderful" bias than men.
I specifically asked about utility maximization in language models. You are now talking about "agentic environments". The only way I know to make a language model "agentic" is to ask it questions about which actions to take. And this is what they did in the paper.
What beyond the result of section 5.3 would, in your opinion, be needed to say "utility maximization" is present in a language model?
Yeah. Apart from DeepSeek-R1, the only other major model which shows its reasoning process verbatim is "Gemini 2.0 Flash Thinking Experimental". A comparison between the CoT traces of those two would be interesting.
Which shows that "commitments" without any sort of punishment are worth basically nothing. They can all just be silently deleted from your website without generating significant backlash.
There is also a more general point about humans: People can't really "commit" to doing something. You can't force your future self to do anything. Our present self treats past "commitments" as recommendations at best.
We already have seen a lot of progress in this regard with the new reasoning models, see this neglected post for details.
The atomless property and only contradictions taking a 0 value could both be consequences of the axioms in question. The Kolmogorov paper (translated from French by Jeffrey) has the details, but from skimming it I don't immediately understand how it works.
If understand correctly, possible probability 0 events are ruled out for Kolmogorov's atomless system of probability mentioned in footnote 7
If you want to understand why a model, any model, did something, you presumably want a verbal explanation of its reasoning, a chain of thought. E.g. why AlphaGo made its famous unexpected move 37. That's not just true for language models.
Actually the paper doesn't have any more on this topic than the paragraph above.
Yeah, I also guess that something in this direction is plausibly right.
perhaps nothingness actually contains a superposition of all logically possible states, models, and systems, with their probability weighted by the inverse of their complexity.
I think the relevant question here is why we should expect their probability to be weighted by the inverse of their complexity. Is there any abstract theoretical argument for this? In other words, we need to find an a priori justification for this type of Ockham's razor.
Here is one such attempt: Any possible world can be described as a long logical conjunction of "basic" facts. By the principle of indifference, assume any basic fact has the same a priori probability (perhaps even probability 0.5, equal to its own negation), and that they are a priori independent. Then longer conjunctions will have lower probability. But longer conjunctions also describe more complex possible worlds. So simpler possible worlds are more likely.
Though it's not clear whether this really works. Any conjunction completely describing a possible world would also need to include a statement "... and no other basic facts are true", which is itself a quantified statement, not a basic fact. Otherwise all conjunctive descriptions of possible worlds would be equally long.
Regular LLMs can use chain-of-thought reasoning. He is speaking about generating chains of thought for systems that don't use them. E.g. AlphaGo, or diffusion models, or even an LLM in cases where it didn't use CoT but produced the answer immediately.
As an example, you ask an LLM a question, and it answers it without using CoT. Then you ask it to explain how it arrived at its answer. It will generate something for you that looks like a chain of thought. But since it wasn't literally using it while producing its original answer, this is just an after-the-fact rationalization. It is questionable whether such a post-hoc "chain of thought" reflects anything the model was actually doing internally when it originally came up with the answer. It could be pure confabulation.
Clearly ANNs are able to represent propositional content, but I haven't seen anything that makes me think that's the natural unit of analysis.
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions. We even routinely communicate by uttering "statements" -- which express propositions. So propositions are natural to us, which is why they are important for ANN interpretability.
It seems they are already doing this with R1, in a secondary reinforcement learning step. From the paper:
2.3.4. Reinforcement Learning for all Scenarios
To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
Because I don’t care about “humanity in general” nearly as much as I care about my society. Yes, sure, the descendants of the Amish and the Taliban will cover the earth. That’s not a future I strive for. I’d be willing to give up large chunks of the planet to an ASI to prevent that.
I don't know how you would prevent that. Absent an AI catastrophe, fertility will recover, in the sense that "we" (rationalists etc) will mostly be replaced with people of low IQ and impulse control, exactly those populations that have the highest fertility now. And "banishing aging and death" would not prevent them from having high fertility and dominating the future. Moloch is relentless. The problem is more serious than you think.
I'm almost certainly somewhat of an outlier, but I am very excited about having 3+ children. My ideal number is 5 (or maybe more if I become reasonably wealthy). My girlfriend is also on board.
It's quite a different question whether you would really pull through with this or whether either of you would change their preference and stop at a much lower number.
Therefore, it's very likely that OpenAI is sampling the best ones from multiple CoTs (or CoT steps with a tree search algorithm), which are the ones shown in the screenshots in the post.
I find this unlikely. It would mean that R1 is actually more efficient and therefore more advanced that o1, which is possible but not very plausible given its simple RL approach. I think it's more likely that o1 is similar to R1-Zero (rather than R1), that is, it may mix languages which doesn't result in reasoning steps that can be straightforwardly read by humans. A quick inference time fix for this is to do another model call which translates the gibberish into readable English, which would explain the increased CoT time. The "quick fix" may be due to OpenAI being caught off guard by the R1 release.
As you say, the addition of logits is equivalent to the multiplication of odds. And odds are related to probability with . (One can view them as different scalings of the same quantity. Probabilities have the range , odds have the range , and logits have the range .)
Now a well-known fact about multiplication of probabilities is this:
- , when and are independent.
But there is a similar fact about the multiplication of odds , though not at all well-known:
- , when and are independent.
That is, multiplying the odds of two independent events/propositions gives you the probability of their conjunction, given that their biconditional is true, i.e. given that they have the same truth values / that they either both happen or both don't happen.
Perhaps this yields some more insight in how to interpret practical logit addition.
Yes. Current reasoning models like DeepSeek-R1 rely on verified math and coding data sets to calculate the reward signal for RL. It's only a side-effect if they also get better at other reasoning tasks, outside math and programming puzzles. But in theory we don't actually need strict verifiability for a reward signal, only your much weaker probabilistic condition. In the future, a model could check the goodness of it's own answers. At which point we would have a self-improving learning process, which doesn't need any external training data for RL.
And it is likely that such a probabilistic condition works on many informal tasks. We know that checking a result is usually easier than coming up with the result, even outside exact domains. E.g. it's much easier to recognize a good piece of art than to create one. This seems to be a fundamental fact about computation. It is perhaps a generalization of the apparent fact that NP problems (with quickly verifiable solutions) cannot in general be reduced to P problems (which are quickly solvable).
A problem with subjective probability is that it ignores any objective fact which would make one probability "better" or "more accurate" than the other. Someone could believe a perfectly symmetrical coin has a 10% chance of coming up heads, even though such a coin is physically impossible.
The concept of subjective probability is independent of any fact about objectively existing symmetries and laws. It also ignores physical dispositions, called propensities, which is like denying that a vase is breakable because this would, allegedly, be like positing a "mysterious force" which makes it true that the vase would break if it dropped.
Subjective probability is only a measure of degree of belief, not of what a "rational" degree of belief would be, and neither is it a measure of ignorance, of how much evidence someone has about something being true or false.
It is also semantically implausible. It is perfectly valid to say "I thought the probability was low, but it was actually high. I engaged in wishful thinking and ignored the evidence I had." But with subjective probability this would be a contradiction, it would be equivalent to saying "My degree of belief was low, but it was actually high". That's not what the first sentence actually expresses.
Related: In the equation , the values of all four variables are unknown, but x and y seem to be more unknown (more variable?) than a and b. It's not clear what the difference is exactly.
Explaining the Shapley value in terms of the "synergies" (and the helpful split in the Venn diagram) makes much more intuitive sense than the more complex normal formula without synergies, which is usually just given without motivation. That being said, it requires first computing the synergies, which seems somewhat confusing for more than three players. The article itself doesn't mention the formula for the synergy function, but Wikipedia has it.
Let me politely disagree with this post. Yes, often desires ("wants") are neither rational nor irrational, but that's far from always the case. Let's begin with this:
But the fundamental preferences you have are not about rationality. Inconsistent actions can be irrational if they’re self-defeating, but “inconsistent preferences” only makes sense if you presume you’re a monolithic entity, or believe your "parts" need to all be in full agreement all the time… which I think very badly misunderstands how human brains work.
In the above quote you could simply replace "preferences" with "beliefs". The form of argument wouldn't change, except that you now say (absurdly) that beliefs, like preferences, can't be irrational. I disagree with both.
One example of irrational desires is Akrasia (weakness of will). This phenomenon occurs when you want something (eat unhealthy, procrastinate, etc) but do not want to want it. In this case the former desire is clearly instrumentally irrational. This is a frequent and often serious problem and adequately labeled "irrational".
Note that this is perfectly compatible with the brain having different parts: E.g.: The (rather stupid) cerebellum wants to procrastinate, the (smart) cortex wants to not procrastinate. When in contradiction, you should listen to your cortex rather than to your cerebellum. Or something like that. (Freud called the stupid part of the motivation system the "id" and the smart part the "super ego".)
Such irrational desires are not reducible to actions. An action can fail to obtain for many reasons (perhaps it presupposed false beliefs) but that doesn't mean the underlying desire wasn't irrational.
Wants are not beliefs. They are things you feel.
Feelings and desires/"wants" are not the same. It's the difference between hedonic and preference utilitarianism. Desires are actually more similar to beliefs, as both are necessarily about something (the thing which we believe or desire), whereas feelings can often just be had, without them being about anything. E.g. you can simply feel happy without being happy about something specific. (Philosophers call mental states that are about something "intentional states" or "propositional attitudes".)
Moreover, sets of desires, just like sets of beliefs, can be irrational ("inconsistent"). For example, if you want x to be true and also want not-x to be true. That's irrational, just like believing x while also believing not-x. A more complex example from utility theory: If describes your degrees of belief in various propositions, and describes your degrees of desire that various proposition are true, and , then . In other words, if you believe two propositions to be mutually exclusive, your expected desire for their disjunction should equal the sum of your expected desires for the individual propositions, a form of weighted average.
More specifically, for a Jeffrey utility function defined over a Boolean algebra of propositions, and some propositions , "the sum is greater than its parts" would be expressed as the condition (which is, of course, not a theorem). The respective general theorem only states that , which follows from the definition of conditional utility
Yeah definitional. I think "I should do x" means about the same as "It's ethical to do x". In the latter the indexical "I" has disappeared, indicating that it's a global statement, not a local one, objective rather than subjective. But "I care about doing x" is local/subjective because it doesn't contain words like "should", "ethical", or "moral patienthood".
Ethics is a global concept, not many local ones. That I care more about myself than about people far away from me doesn't mean that this makes an ethical difference.
This seems to just repeat the repugnant conclusion paradox in more graphic detail. Any paradox is such that one can make highly compelling arguments for either side. That's why it's called a paradox. But doing this won't solve the problem. A quote from Robert Nozick:
Given two such compelling opposing arguments, it will not do to rest content with one's belief that one knows what to do. Nor will it do to just repeat one of the arguments, loudly and slowly. One must also disarm the opposing argument; explain away its force while showing it due respect.
Tailcalled talked about this two years ago. A model which predicts text does a form of imitation learning. So it is bounded by the text it imitates, and by the intelligence of humans who have written the text. Models which predict future sensory inputs (called "predictive coding" in neuroscience, or "the dark matter of intelligence" by LeCun) don't have such a limitation, as they predict reality more directly.
This still included other algorithmically determined tweets -- from what your followers had liked and later more generally "recommended" tweets. These are no longer present in the following tab.