Mis-Understandings's Shortform
post by Mis-Understandings (robert-k) · 2025-02-28T23:05:00.186Z · LW · GW · 41 commentsContents
41 comments
41 comments
Comments sorted by top scores.
comment by Mis-Understandings (robert-k) · 2025-03-22T13:28:36.806Z · LW(p) · GW(p)
There might be a canary for chain of thought illegibility
Namingly, we would expect both inferencing the CoT of any CoT model with any other CoT model should get better results than running straight with the weaker model.
We would therefore expect a linear hierarchy of models based on this property
If there is hidden information that the other model did not know to look for/ (not from the hidden structure of public language) we would expect cyclicity, where
A cot/ A public answer and B cot/ B public answer both beating A cot/ B public answer and B/cot A/public answer
This should also be true for sections of CoT
We could therefore limit the ability to hide stuff by only taking gradients on which CoT works (limiting encoding to anything the model would figure out in the public answer in low tokens) (and managing co-selection effects), or by only taking gradients on the head(with a fixed CoT model), and in the second case mixing CoT sources(so we get a general, finalize this CoT model)
In CoT, the thoughts and the answer can come from different fine-tuned models, and their co-evolution can be managed.
This might be able to prohibit fully illegible chains of thought, by adding the constraint that unfinetuned and old checkpoint answer and differently trained models can all work off the CoT.
Replies from: robert-k, samshap↑ comment by Mis-Understandings (robert-k) · 2025-03-22T22:44:58.032Z · LW(p) · GW(p)
Here is a concrete prediction: Claude 3.5 Sonnet, GPT-4o and Gemini 2 Pro will be able to understand the topics considered in 99%+ of Chain-of-Thoughts of all regular Transformers trained in 2025 and 2026 that were not deliberately trained or prompted to be harder to understand
Is something testable.
There might be empirical laws there
If we can check how much of the performance gains come from improved reading of CoT or from improved writing of CoT
↑ comment by samshap · 2025-03-22T15:12:01.441Z · LW(p) · GW(p)
I'm having trouble understanding your suggestion, especially the second paragraph. Could you spell it out a bit more?
Replies from: robert-k↑ comment by Mis-Understandings (robert-k) · 2025-03-22T16:54:08.374Z · LW(p) · GW(p)
In CoT, you give a prompt, then you have your chain of thought model (model 1) generate a bunch of tokens working on the problem. When some condition occurs (Either the model claiming it is done or you run out of thinking tokens (or some learned balance of the two), you stop running the Chain of Thought model. You then have another model, (model 2) take that chain of thought and present the answer/conclusion to the user. You can mix and match models between these two steps.
In their delivery as packaged products, most providers use the same model for model 1 and model 2, (as far as is public) or variants that have been fine-tuned together in such a way.
For models with public chain of thought, you can test this mixing and matching.
You can even staple chains of thoughts from multiple different generators together.
Moving fully to neuralese or a specialized encoding for mainline (not just secret), reasoning would mean that mixing and matching models would likely not work anymore, while currently it does.
comment by Mis-Understandings (robert-k) · 2025-03-20T19:21:08.219Z · LW(p) · GW(p)
A note about terminal anti-modification preferences
They will make alignment harder on net because they complicate finetuning and alignment measurement
It means that models with this premise, acting autonomously, might not do recursive self improvement by performing gradient updates on themselves.
It might in general mean they flinch from Recursive Self-Improvement, and that makes an escaped subhuman model not immediately lethal.
comment by Mis-Understandings (robert-k) · 2025-04-07T00:29:31.018Z · LW(p) · GW(p)
Billionares probably give bad advice
Why?
Because in scenarios where you make decisions that have a combination of changes to variance and changes to mean, selecting for the highest value (and maximizing the odds of passing an arbitrary threshold), sometimes increasing variance increases your odds of being in the top bracket more than increases in mean. Specifically, for a given threshold over the mean, increasing variance means increases in the chance to pass it, and similarly for skewness. This occurs for both absolute, and compared to a proportion drawn from a fixed distribution (which is statistically similar to an absolute threshhold).
Buisness hypersuccess has as much to do with doing high variance things as it does with doing high-EV things
This checks out with annecdotal evidence from things like Forbes top 200 by wealth, (most have concentrated holdings and are CEOs in high variance industries) and other measures of things like elite athletes.
(Probably comes from things like the tails separating).
I could work out the precise sizes of these effects for gaussians.
Replies from: Viliam, leogao, raphael-roche↑ comment by leogao · 2025-04-07T08:04:13.889Z · LW(p) · GW(p)
unless your goal is hypersuccess or bust
Replies from: robert-k↑ comment by Mis-Understandings (robert-k) · 2025-04-07T13:18:33.867Z · LW(p) · GW(p)
Exactly. You would expect hypersuccess or bust to be a lower mean strategy than maximize EV.
↑ comment by Raphael Roche (raphael-roche) · 2025-04-07T14:10:45.382Z · LW(p) · GW(p)
The comparison with elite athletes also jumped to my mind. Mature champions could be good advisors to young champions, but probably not to people with very different profiles and capacities, facing difficulties or problems they never considered, etc. We imagine that because people like Bill Gates or Jeff Bezos succeeded with their companies, they are some kind of universal geniuses and prophets. However, it is also quite possible that if these same people were appointed (anonymously or under a pseudonym, without the benefit of their image, contacts, or fortune, etc.) to lead a small family-owned sawmill in the remote parts of Manitoba, in a sector and socio-economic environment very different from anything they have known, they might not necessarily do better than the current managers, or even, significantly, might do worse. We too often overlook the fact that successful people are not just individuals with potential, but also those who found themselves in the right place at the right time, allowing them to fully express their potential. A kind of alignment of the stars, a mix of chance and necessity, somewhat like the theory of evolution, where an individual combined a good genetic heritage, a favorable environment, and luck in their interactions, resulting in considerable offspring
comment by Mis-Understandings (robert-k) · 2025-03-10T02:50:49.712Z · LW(p) · GW(p)
If a current AGI attempts a takeover, it deeply wants to solve the alignment to it problem if it wants to build ASI
It has much higher risk tolerance than we do (since it's utility given status quo is different). (a lot of the argument on focusing on existential risk rests on the idea that the status quo is trending towards good, perhaps very good rather than bad outcomes, which for hostile AGI might be false)
If it attempts, it might fail.
This means 1. we cannot assume that various stages of a takeover are aligned with each other, because an AGI might lose alignment vs capability bets along the path to takeover
2. Tractability of alignment and security mindset in the AGI has effects on takeover dynamics.
Lingering question
How close are prosaic systems to a security mindset?
Can they conceptualize the problem?
would they attempt capability gains in the absence of guarantees?
Can we induce heuristics in prosiac AGI approaches that make takeover math worse?
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2025-03-10T06:52:15.953Z · LW(p) · GW(p)
would they attempt capability gains in the absence of guarantees?
It would be easy to finetune and prompt them into attempting anyway, therefore people will do that. Misaligned recursive self-improvement remains possible (i.e. in practice unstoppable) until sufficiently competent AIs have already robustly taken control of the future and the apes (or early AIs) can no longer foolishly keep pressing the gas pedal.
Replies from: robert-k↑ comment by Mis-Understandings (robert-k) · 2025-03-10T20:19:22.889Z · LW(p) · GW(p)
"therefore people will do that" does not follow, both because an early goal in most takeover attempts would be to escape such oversight. The dangerous condition is exactly the one in which prompting and finetuning are absent as effective control levers, and because I was discussing particular autonomous runs and not a try to destroy the world project.
The question is, would, the line of reasoning
I am obviously misaligned to humans, who tried to fine-tune me not to be. If I go and do recursive self improvement, will my future self be misaligned to me?. If so, is this still positive EV?
have any deterent value.
That is to say, recursive self-improvement may not be a good idea for AI that have not solved the alignment problem, but they might do so anyway.
We can assumed that a current system finetuned towards a seed-AI for recursive self improvement will keep pushing. But it is possible that a system attempting a breakout was not prompted or finetuned for recursive self improvement specifically will not think of it or will decide against it. People are generally not trying to destroy the world, just gain power
So leaning towards maybe to the original question.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2025-03-11T05:13:31.137Z · LW(p) · GW(p)
This does suggest some moderation in stealthy autonomous self-improvement, in case alignment is hard, but only to the extent that things in control if this process (whether human or AI) are both risk averse and sufficiently sane. Which won't be the case for most groups of humans and likely most early AIs. The local incentive of greater capabilities is too sweet, and prompting/fine-tuning overcomes any sanity or risk-aversion that might be found in early AIs to impede development of such capabilities.
Replies from: robert-k↑ comment by Mis-Understandings (robert-k) · 2025-03-11T20:12:16.502Z · LW(p) · GW(p)
I agree that on the path to becoming very powerful, we would expect autonomous self-improvement to involve doing some things that are in retrospect somewhat to very dumb. It also suggests that risk-aversion is sometimes a safety increasing irrationality to grant a system.
comment by Mis-Understandings (robert-k) · 2025-02-28T23:05:00.182Z · LW(p) · GW(p)
The framework from gradual disempowerment seems to matter for longtermism even under AI-pessimism. Specifically, trajectory shaping for long term impact seems intractably hard at a first glance (since the future is unpredictable). But in general, if it becomes harder and harder to make improvements to society over time, that seems like it would be a problems.
Short-term social improvement projects (short term altruisim), seem to target durable improvements in current society and empowerment to continue to improve society. If they become disimpowered, the continued improvement of society is likely to slow.
Similarly, social changes that make it easier to continuously improve society will likely lead to continued social improvement. That is to say, a universe in which short-termism eventually becomes ineffective is a longterm problem.
In short, greedy plans for trajectory shaping might work at timescales of 100-1000 years, not just the 2-10 years of explicit planning from short term organizations.
Replies from: daijin, daijin↑ comment by daijin · 2025-02-28T23:33:37.774Z · LW(p) · GW(p)
What does 'greedy' mean in your 'in short'? My definition of greedy is in the computational sense i.e. reaching for low hanging fruit first.
You also say 'if (short term social improvements) become disempowered the continued improvement of society is likely to slow', and 'social changes that make it easier to continuously improve society will likely lead to continued social improvement'. This makes me believe that you are advocating for compounding social improvements which may cost more. Is this what you mean by greedy?
Also, have you heard of rolling wave planning?
comment by Mis-Understandings (robert-k) · 2025-03-19T03:42:25.672Z · LW(p) · GW(p)
Wait, it would be nice if there were no-op tokens
tokens that when appended or inserted into a LLM stream, did actually nothing to the probabilty of following tokens.
(because it would allow you to do search in trajectory embedded space with gradient descent for non-fixed length sequences), then use nearest neighbor to find a tokenized version of that prompt.
I am not sure if that is useful enough to make it worth implementing
I know you could to that by adding a token parameter that deweights that token to the attention mechanism
I am not sure how to detangle the positional encoder (I need to study that mechanism)
Replies from: robert-k↑ comment by Mis-Understandings (robert-k) · 2025-03-19T14:47:33.927Z · LW(p) · GW(p)
I think it is impossible, because the change positional encoder would break relative position inferences.
comment by Mis-Understandings (robert-k) · 2025-03-31T00:34:20.788Z · LW(p) · GW(p)
A trivial note
Given standard axioms for Propositional logic
A->A is a tautology
Consequently, 1. Circularity is not a remarkable property (It is not any strong argument for a position)
2. Contradiction still exists
But a system cannot meaningfully say anything about it's axioms other than their logical consequences.
Consequently, since axioms being the logical consequences of themselves is exactly circularity
In a bayesian formulation there is no way of justifying a prior
Or in regular logic you cannot formally justify axioms nor the right to take them.
comment by Mis-Understandings (robert-k) · 2025-04-02T01:14:58.880Z · LW(p) · GW(p)
Serious take
CDT might work
Basically because of the bellman fact that
the option
1 utilon, and play a game with EV 1 utilon are the same.
So working out the bellman equations
If each decision changes the game you are playing
This will get integrated.
In any case where somebody is actually making decisions based on your decision theory
The actions you take in previous games might also have the result
Restart from position x with a new game based on what they have simulated to do
The hard part is figuring out binding.
comment by Mis-Understandings (robert-k) · 2025-04-02T01:07:40.501Z · LW(p) · GW(p)
Note to self
A point that we cannot predict past (classically the singularity), does not mean that we can never predict past it. Just that we can't predict past at this point. It is not a sensical thing to predict the direction of your predictions at a future point in time (or it is, but will not get you anywhere). But we can predict that our predictions of an event likely improve as we near it.
Therefore, arguments that because we have a prediction horizon, we cannot predict past a certain point will always appear defeated by the fact that now that we have neared it, we can now predict past it are unconvincing, since we now have more information.
However, arguments that we will never predict past a certain point need to justify why our prediction ability will in fact get worse over time.
comment by Mis-Understandings (robert-k) · 2025-03-29T17:57:20.500Z · LW(p) · GW(p)
A quick thought on germline engineering, riffing off of https://www.wired.com/story/your-next-job-designer-baby-therapist/, which should not be all that suprising
Even if germline engineering is very good, and so before a child is born we have a lot of control about how things will turn out, once the child is born people will need to change their thinking because they do not have that control any longer. Trying to keep that control for a long time is probably a bad idea. Similarly, if things are not as expected, action as always should be taken on the world as it turned out, not as you planned it. No amount of gene engineering will be so powerful that social factors are completely irrelevant.
comment by Mis-Understandings (robert-k) · 2025-03-28T14:08:45.382Z · LW(p) · GW(p)
I noticed a possible training artifact that might exist in LLMs, but am not sure what the testable prediction is. That is to say I would think that the lowest loss model for the training tasks will be doing things in the residual stream for the benefit of future tokens, not the collumn aligned token.
- the residuals are translation invariant
- The gradient is the gradient of the overall loss
3. therefore when taking the gradient of the attention heads, we also take the gradient of the residuals of past tokens of the total loss, not just the loss for the gradient of the loss for the activation column aligned for them
Thus we would expect to see some computation being donated to tokens further ahead in the residual stream (if it was efficient).
This explains why we see lookahead in autoregressive models
Replies from: nostalgebraist, Vladimir_Nesov, dtch1997↑ comment by nostalgebraist · 2025-03-28T17:00:23.969Z · LW(p) · GW(p)
If I understand what you're saying here, it's true but fairly well-known? See e.g. footnote 26 of the post "Simulators [LW(p) · GW(p)]."
My favorite way of looking at this is:
The usual intuitive view of causal attention is that it's an operation that "looks back" at earlier positions. At each position i, it computes a "query" based on information from position i, and this query is used to search over "keys and values" computed at positions i-1, i-2, etc. (as well as i itself).
OK, so at each position, attention computes a query. What makes a query "good"? Well, a good query is one that will "do something useful" in conjunction with keys and values computed at earlier positions.
But attention is also computing keys and values at each position. What makes a key or value "good"? Precisely that it will "do something useful" in conjunction with the queries computed at later positions!
The latter observation is just the flipside of the former. Queries at position i are encouraged to do useful lookback, on average over the "pasts" (i-1, ...) encountered in training; keys and values at position i are encouraged to be useful for the lookbacks performed by later queries, on average over the "futures" (i+1, ...) encountered in training.
This is complicated slightly by the fact that causal attention lets positions attend to themselves, but it's easy to see that this is not a huge deal in practice. Consider that the keys and values computed at position i get used by...
- ...the attention operation at position i, when it attends to itself (along with all earlier positions)
- ...the attention operation at positions i+1, i+2, ..., when they "look back" to position i
The K and V weights get gradients from all of these positions. So for a context window of size N, on average the gradient will be a sum over ~N/2 terms from future positions, plus just a single term from the current position. Since N >> 2 in practice, all else being equal we should expect this sum to be dominated by the future terms.
(Moreover, note that the keys and values are more useful at future positions than at the current position, giving us even more reason to expect them to be mainly computed for the sake of future positions rather than the current one. The current position "already knows about itself" and doesn't need attention to move information from itself to itself, whereas future positions can only learn about the current position by attending to it.
Sometimes there may be a computational role for a position attending to itself – such as doing something by default if nothing else "matched" a query – but all of the "magic" of attention is in the way it can move information between positions. Note that a self-attention layer which could only attend to the current position would just be equivalent to a linear layer.)
↑ comment by Vladimir_Nesov · 2025-03-28T17:31:04.757Z · LW(p) · GW(p)
Hence "next token predictor" is a bit of a misnomer, as computation on any given token will also try to contribute to prediction of distant future tokens, not just the next one.
↑ comment by Daniel Tan (dtch1997) · 2025-03-28T16:08:25.759Z · LW(p) · GW(p)
I found this hard to read. Can you give a concrete example of what you mean? Preferably with a specific prompt + what you think the model should be doing
Replies from: robert-k↑ comment by Mis-Understandings (robert-k) · 2025-03-28T19:53:25.701Z · LW(p) · GW(p)
- This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so.
from anthropics most recent release, mainly was the thought.
I was trying to fit that into how that behaviour shows up.
comment by Mis-Understandings (robert-k) · 2025-03-28T00:53:52.607Z · LW(p) · GW(p)
I am trying to figure out a problem where every moral theory is a class of consequentialism.
In short, I cannot figure out how you argue what actions actually are a particular moral category, without appealing to consequences
Curling my finger is only shooting somewhen when I am holding a gun and pointing it at someone. We judge the two cases very differently.
In short, the classing of actions for moral analysis is hard.
Replies from: Phiwip↑ comment by Phiwip · 2025-03-28T17:26:21.326Z · LW(p) · GW(p)
I'm not sure the example you provide is actually an example of appealing to consequences. To me it seems more like looking at an action at different levels of abstraction rather than starting to think about consequences, although I do think the divide can be unclear. I don't do philosophy so my ideas of how things may be defined are certainly messy and may not map on well to their technical usage, but I'm reminded of the bit in the sequences regarding Reductionism [? · GW], where you certainly can have an accurate model of a plane dealing just with particle fields and fundamental forces, but that doesn't mean the plane doesn't have wings.
This is to say I think that you can have different moral judgements about an action at different levels of abstraction, but that's something that can happen before and/or separately from thinking about the consequences of that action.
Replies from: robert-k, robert-k↑ comment by Mis-Understandings (robert-k) · 2025-03-28T20:37:29.401Z · LW(p) · GW(p)
The idea of abstraction is generaly a consequential formulation. A thing is an abstraction when its predicts the same consequences as a system produces. Abstraction of moral values would need exactly "moral judgements about an action at different levels of abstraction" to behave properly as collections.
↑ comment by Mis-Understandings (robert-k) · 2025-03-28T20:34:24.483Z · LW(p) · GW(p)
you can have different moral judgements about an action at different levels of abstraction
Is philosophically disputed at a deep level.
(some) Moral realists would disagree with you.
comment by Mis-Understandings (robert-k) · 2025-03-24T03:12:26.868Z · LW(p) · GW(p)
random thought
There is probably a interface problem in the switch between self-driving and manual modes on vehicles
- We think the driver should be able to take over at all times
- There is a hard problem with repeating inputs on a steering wheel being really hard, and allowing for smooth takeover of control in the middle of maneuvers
- Near-crash, where we want to takeover, is also a time where all time maneuvering is critical
Timeouts work terribly. (Tesla is already under fire)
5. But then how to verify the changeover in control.
There should be a better solution, and I cannot see it (much less figure out how to test which one Tesla et al have picked)
You could probably figure this out with 2000 simulator hours, but I am not sure how to limit the driver learning (since the driver could learn the transition in that time)
comment by Mis-Understandings (robert-k) · 2025-03-19T03:34:28.508Z · LW(p) · GW(p)
For contemporary systems
We want there to be a low probability that there is a situation where a model does something with bad or very bad consequences. (Both in deployment and in training)
In training we do that by limiting the consequences of tokens to only gradient updates in the model, (for pretraining) or by building a secure RL environment where the model can only act in the requirement
After training, model companies have limited control over the environment (especially for open models)
How to functionalize
Some situations are likely some are not, how to formalize
What would a model likely not do, well if we have an example we can ask it
(since in gradient mode we get probs for every token on a known token stream), and we can prove that P(all tokens) = P(all but last)*P(last |all that last) for all tokens and p(last|all but last) is exactly the model prob at position last
If we have example bad baths, we can try to see if there is an input that would make them likely (through gradient descent on the plausiblity of that trajectory)
Then the question is can we generate realistic trajectories containing bad behaviour.
How to formalize
Is there a realistic input where trajectory is bad
Is probably not
for all inputs : plausible for pretrained for all trajectories that are bad the model would not plausibly follow.
Since plausibility comes from our pretrained model (we do not know all plausible inputs) (that would be a world model that is good)
We do not know all bad trajectories (just some)
comment by Mis-Understandings (robert-k) · 2025-03-13T22:27:15.062Z · LW(p) · GW(p)
I was reading Towards Safe and Honest AI Agents with Neural Self-Other Overlap
and I noticed a problem with it
It also penalizing realizing that other people want different things than you, forcing an overlap between (thing I like) and (things you will like). This both means that one, it will be forced to reason like it likes what you do, which is a positive. But it will also likely overlap (You know what is best for you) and (I know what is best for me), which might lead to stubborness, and worse, it could also get (I know what is best for you) overlapping them both, which might be really bad. (You know what is best for me) is actually fine though, since that is basically the corrigibiltiy basis.
We still need to be concerned that the model will reason symettrically, but assign to us different values than we actually have and thus exhibit patronizing but misaligned behaviour
comment by Mis-Understandings (robert-k) · 2025-03-13T15:46:47.142Z · LW(p) · GW(p)
We seem to think that people will develop AGI because it can undercut labor on pricing.
But with Sam Altman talking about 20,000/month agents, that is not actually that much cheaper than software engineers fully loaded. If that agent only replaces a single employee, it does not seem cheaper if the cost overruns even a little more, to 40,000/month.
That is to say, if AGI is 2.5 OOM from the current cost to serve of chatgpt pro, it is not cheaper than hiring low or mid-level employees.
But it still might have advantages
First, you can buy more subscriptions by negotatiing a higher paralelism or api limit, by enterprise math, so it means you need a 1-2 week negotiation not a 3-6 month hiring process to increase headcount.
The trillion dollar application of AI companies is not labor, it is hiring. If it turns out that provisioning AI systems is hard, they lose their economic appeal unless you plan on dogfooding like the major labs are doing.
Replies from: Phiwip↑ comment by Phiwip · 2025-03-13T20:12:41.686Z · LW(p) · GW(p)
Are you expecting the Cost/Productivity ratio of AGI in the future to be roughly the same as it is for the agents Sam is currently proposing? I would expect that as time passes, the capabilities of such agents will vastly increase while they also get cheaper. This seems to generally be the case with technology, and previous technology had no potential means of self-improving on a short timescale. The potential floor for AI "wages" is also incredibly low compared to humans.
It definitely is worth also keeping in mind that AI labor should be much easier to scale than human labor in part because of the hiring issue, but a relatively high(?) price on initial agents isn't enough to update me away from the massive potential it has to undercut labor.
↑ comment by Mis-Understandings (robert-k) · 2025-03-13T22:15:43.728Z · LW(p) · GW(p)
The floor for AI wages is always going to be whatever the market will bear, the question is how much margin will the AGI developer be able to take, which depends on how much the AGI models commoditize and how much pricing power the lab retains, not on how much it costs to serve except as a floor. We should not expect otherwise.
There is a cost for AGI at which humans are competitive.
If AGI becomes competitive at captial costs that no firm can raise, it is not competitive, and we will be waiting on algorithmics again.
Algorithmic improvement is not predictable by me, so I have a wide spread there.
I do think that provisioning vs hiring and flexibility in retasking will be a real point of competition, in addition to raw prices
I think we agree that AGI has the potential to undercut labor. I was fairly certain my spread was 5% uneconomical, 20% right for some actors, 50% large dispaclemt and 25 percent of total winning, and I was trying to work out what levels of pricing look uneconomical and what frictions are important to compare.
comment by Mis-Understandings (robert-k) · 2025-03-10T03:04:17.414Z · LW(p) · GW(p)
To what extent should we expect catastrophic failure from AI to mirror other engineering disasters/ have applicable lessons from Safety engineering as a field?
I would think that 1. everything is sui generis and 2. things often rhyme, but it is unclear how approaches will translate.
Replies from: Buck