Posts
Comments
Are we missing a notion of "simulacrum level 0"? That is, in order to accurately describe the truth, we need some method of synchronizing on a common language. In the beginning of a human society, this can be basic stuff like pointing at objects and making sounds in order to establish new words. But also, I would be inclined to say that more abstract stuff like discussing the purpose for using the words or planning truth-determination-procedures also go in simulacrum level 0. I'd say the entire discussion of simulacrum levels goes within simulacrum level 0.
Or if simulacrum levels aren't exactly the right term, here's what I have in mind as levels of communication:
- Synchronizing (level 0): establishing and maintaining the meaning of terms for describing the world
- Objective (level 1): truthfully describing the world to the best of one's ability
- Manipulative (level 2): saying known false or unfounded things to exploit other's use of language to control them
- Framing (level 3): the norms for maintaining truth no longer succeed, but they are still in operation and punish or reward people, so people try to act in ways that maintain their reputation despite them not tracking truth
- Activating (level 4): the norms for maintaining truth are no longer in place, but some systems still rely on the old symbolic language as keywords to perform certain behaviors, so language is still used to interface with these systems
Yeah, this seems like a reasonable restatement of my question.
I guess my main issue with this approach is that extrapolating the distribution of activations from a dataset isn't what I'd consider the hard part of alignment. Rather, it would be:
- Detecting catastrophic outputs and justifying their catastrophicness to others. (In particular, I suspect no individual output will be catastrophic on the margin regardless of whether catastrophe will occur. Either the network will consistently avoid giving catastrophic outputs, or it will sufficiently consistently be harmful that localizing the harm to 1 output will not be meaningful.)
- Learning things about the distribution of inputs that cannot be extrapolated from any dataset. (In particular, the most relevant short-term harm I've noticed would be stuff like young nerds starting to see the AI as a sort of mentor and then having their questionable ideas excessively validated by this mentor rather than receiving appropriate pushback. This would be hard to extrapolate from a dataset, even though it is relatively obvious if you interact with certain people. Though whether that counts as "catastrophic" is a complicated question.)
This is kind of vague. Doesn't this start shading into territory like "it's technically not bad to kill a person if you also create another person"? Or am I misunderstanding what you are getting at?
Population ethics is the most important area within utilitarianism, but utilitarian answers to population ethics are all wrong, so therefore utilitarianism is an incorrect moral theory.
You can't weasel your way out by calling it an edge-case or saying that utilitarianism "usually" works when really it's the most important moral question. Like all the other big-impact utilitarian conclusions derive from population ethics since they tend to be dependent on large populations of people.
Utilitarianism can at best be seen as like a Taylor expansion that's valid only for questions whose impact on the total population are negligible.
Maybe to expand: In order to get truly good training loss on an autoregressive training objective, you probably need to have some sort of intelligence-like or agency-like dynamic. But much more importantly, you need a truly vast amount of knowledge. So most of the explanation for the good performance comes from the knowledge, not the intelligence-like dynamic.
(Ah, but intelligence is more general, so maybe we'd expect it to show up in lots of datapoints, thereby making up a relatively big chunk of the training objective? I don't think so, for two reasons: 1) a lot of datapoints don't really require much intelligence to predict, 2) there are other not-very-intelligence-requiring things like grammar or certain aspects of vocabulary which do show up in a really big chunk.)
Would "the neural network has learned a lookup table with a compressed version of the dataset and interpolates on that in order to output its answers" count as an explanation of the low dataset loss?
(Note, this phrasing kind of makes it sound too simple. Since the explanations you are seeking presumably don't come with the dataset baked-in as a thing they can reference primitively, presumably the actual formal statement would need to include this entire compressed lookup table. Also, I'm imagining a case where there isn't really a "compression algorithm" because the compression is intimately tied up with the neural network itself, and so it's full of ad-hoc cases.)
Like I guess from an alignment perspective this could still be useful because it would be nice to know to what extent "bag of heuristics" holds, and this is basically a formalization of that. But at the same time, I already roughly speaking (with lots of asterisks, but not ones that seem likely to be addressed by this) expect this to hold, and it doesn't really rule out other dangers (like those heuristics could interact in a problematic way), so it seems kind of like it would just lead to a dead-end from my perspective.
If this is meant to be a weakening of NP vs co-NP, what do you make of the stronger statement that NP = co-NP? As I understand it, this most complexity theorists think this is false. Do you have any reason to think that your conjecture is much much more likely to hold than NP = co-NP, or do you also think NP = co-NP could hold?
Maybe I'm missing something, but if we are estimating the P(Xi), how can we also have Xi on RHS?
These probabilities are used for scoring predictions over the observed variables once the market resolves, so at that point we "don't need" P(Xi) because we already know what Xi is. The only reason we compute it is so we can reward people who got the prediction right long ago before Xi was known.
and what is the adjustment +(1−Xi)(1−qi,j). why is that there?
Xiqi,j+(1−Xi)(1−qi,j) is equivalent to "qi,j if Xi = 1; otherwise 1-qi,j if Xi = 0". It's basically a way to mathematize the "contigency table" aspect.
And they wouldn't be getting any profit. (In the updated comment, I noted it's only the profit that measures your trouble.)
Exports and imports are tricky but very important to take into account here because they have two important properties:
* They are "subtracted off" the GDP numbers in my explanation above (e.g. if you import a natural resource, then that would be considered part of the GDP of the other country, not your country)
* They determine the currency exchange rates (since the exchange rate must equal the ratio of imports to exports, assuming savings and bonds are negligible or otherwise appropriately accounted for) and thereby the GDP comparisons across different countries at any given time
Prices decompose into cost and profit. The profit is determined by how much trouble the purchaser would be in if the seller didn't exist (since e.g. if there's other sellers, the purchaser could buy from those). The cost is determined by how much demand there is for the underlying resources in other areas, so it basically is how much trouble the purchaser imposes on others by getting the item. Most products are either cost-constrained (where price is mostly cost) or high-margin (where price is mostly profit).
GDP is price times transaction volume, so it's the sum of total costs and total profits in a society. The profit portion of GDP reflects the extent to which the economy has monopolized activities into central nodes that contribute to fragility, while the cost portion of GDP reflects the extent to which the economy is resource-constrained.
The biggest costs in a modern economy is typically labor and land, and land is typically just a labor cost by proxy (land in the middle of nowhere is way cheaper, but it's harder to hire people). The majority of the economy is cost-constrained, so for that majority, GDP reflects underpopulation. The tech sector and financial investment sector have high profit margins, which reflects their tendency to monopolize management of resources.
Low GDP reflects slack. Because of diminishing marginal returns and queuing considerations, ideally one should have some slack, since then there's abundance of resources and easy competition, driving prices down and thus leading to low GDP at high quality of life. However, slack also leads to conflict because of reduced opportunity cost. This conflict can be reduced with policing, but that increases authoritarianism. This leads to a tradeoff between high GDP and high tension (as seen in the west) vs low GDP and high authoritarianism (as seen in the east) vs low GDP and high conflict (as seen in the south).
Hmm... Issue is it also depends on centralization. For a bunch of independent transactions, fragility goes up with the square root of the count rather than the raw count. In practice large economies are very much not independent, but the "troubles" might be.
It's elementary that the derivative approaches zero when one of the inputs to a softmax is significantly bigger than the others. Then when applying the chain rule, this entire pathway for the gradient gets knocked out.
I don't know to what extent it comes up with modern day LLMs. Certainly I bet one could generate a lot of interpretability work within the linear approximation regime. I guess at some point it reduces to the question of why to do mechanistic interpretability in the first place.
Framing: Prices reflect how much trouble purchasers would be in if the seller didn't exist. GDP multiplies prices by transaction volume, so it measures the fragility of the economy.
I would be satisfied with integrated gradients too. There are certain cases where pure gradient-based attributions predictably don't work (most notably when a softmax is saturated) and those are the ones I'm worried about (since it seems backwards to ignore all the things that a network has learned to reliably do when trying to attribute things, as they are presumably some of the most important structure in the network).
I would be curious what you think of [this](https://www.lesswrong.com/posts/TCmj9Wdp5vwsaHAas/knocking-down-my-ai-optimist-strawman).
Ah, I see. I've gone and edited my rebuttal to be more forceful and less hedgy.
Strawman and steelman arguments are the same thing. It's just better to label them "strawman" so rather than "steelman" so you don't overestimate their value.
I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.
But anyway, under my current model (roughly Why I'm bearish on mechanistic interpretability: the shards are not in the network + Binary encoding as a simple explicit construction for superposition) it seems about as natural to use K-means as it does to use SAEs, and not necessarily an issue if K-means outperforms SAEs. If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space, then K-means seems like a perfectly cromulent quantization for identifying these volumes. The major issue is where we go from here.
Is there really some particular human whose volition you'd like to coherently extrapolate over eternity but where you refrain because you're worried it will generate infighting? Or is it more like, you can't think of anybody you'd pick, so you want a decision procedure to pick for you?
If there is some particular human, who is it?
If you wanted to have an unaligned LLM that doesn't abuse humans, couldn't you just never sample from it after training it to be unaligned?
How would such a validator react if you tried to hack the LLM by threatening to kill all humnas unless it complies?
If so, it seems that all you need to do to detect any unwanted behaviour from a superintelligent system is to feed all output from constituent LLMs to a simpler LLM to detect output that looks like it's leading towards unaligned behaviour. Only once the output has been verified, pass it on to the next system (including looping it back to itself to output more tokens). If it fails verification immediately stop the whole system.
What is unaligned behavior and what does output that leads to it look like?
Correspondingly the importance I assign to increasing the intelligence of humans has drastically increased.
I feel like human intelligence enhancement would increase capabilities development faster than alignment development, maybe unless you've got a lot of discrimination in favor of only increasing the intelligence of those involved with alignment.
Those are sort of counterstatements against doom, explaining that you don't see certain problems that doomers raise. But the OP more attempts to just make an independently-standing argument about what is present.
It's still not obvious to me why adversaries are a big issue. If I'm acting against an adversary, it seems like I won't make counter-plans that lead to lots of side-effects either, for the same reasons they won't.
I mean we can start by noticing that historically, optimization in the presence of adversaries has lead to huge things. The world wars wrecked Europe. States and large bureaucratic organizations probably exist mainly as a consequence of farm raids. The immune system tends to stress out the body a lot when it is dealing with an infection. While it didn't actually trigger, the nuclear arms race lead to existential risk for humanity, and even though it didn't trigger the destruction, it still made people quite afraid of e.g. nuclear power. Etc..
Now, why does trying to destroy a hostile optimizer tend to cause so much destruction? I feel like the question almost answers itself.
Or if we want to go mechanistic about it, one of the ways to fight back the nazis is with bombs, which deliver a sudden shockwave of energy that has the property of destroying nazi structures and everything else. It's almost constitutive of the alignment problem: we have a lot of ways of influencing the world a lot, but those methods do not discriminate between good and evil/bad.
From an abstract point of view, many coherence theorems rely on e.g. Dutch books, and thus become much more applicable in the case of adversaries. The coherence theorem "if an agent achieves its goals robustly regardless of environment, then it stops people who want to shut it down" can be trivially restated as "either an agent does not achieve its goals robustly regardless of environment, or it stops people who want to shut it down", and here non-adversarial agents should obviously choose the former branch (to be corrigble, you need to not achieve your goals in an environment where someone is trying to shut you down).
From a more strategic point of view, when dealing with an adversary, you tend to become a lot more constrained on resources because if the adversary can find a way to drain your resources, then it will try to do so. Ways to succeed include:
- Making it harder for people to trick you into losing resources, by e.g. making it harder for people to predict you, and being less trusting of what people tell you, and wining as quickly as possible
- Gaining more resources by grabbing them from elsewhere
Also, in an adversarial context, a natural prior is that inconveniences are there for a reason, namely to interfere with you. This tends to make enemies.
I think mesa-optimizers could be a major-problem, but there are good odds we live in a world where they aren't. Why do I think they're plausible? Because optimization is a pretty natural capability, and a mind being/becoming an optimizer at the top-level doesn't seem like a very complex claim, so I assign decent odds to it. There's some weak evidence in favour of this too, e.g. humans not optimizing of what the local, myopic evolutionary optimizer which is acting on them is optimizing for, coherence theorems etc. But that's not super strong, and there are other simple hypotheses for how things go, so I don't assign more than like 10% credence to the hypothesis.
Mesa-optimizers definitely exist to varying degrees, but they generally try to not get too involved with other things. Mechanistically, we can attribute this to imitation learning, since they're trying to mimick human's tendency to stitch together strategies in a reasonable way. Abstractly, the friendliness of instrumental goals shows us why unbounded unfriendly utility maximizers are not the only or even main attractor here.
(... Some people might say that we have a mathematical model of unbounded unfriendly utility maximizers but not of friendlier bounded instrumental optimizers. But those people are wrong because the model of utility maximizers assumes we have an epistemic oracle to handle the updating, prediction and optimization for us, and really that's the computationally heavy part. One of the advantages of more bounded optimization like in the OP is that it ought to be more computationally tractable because different parts of the plans interfere less with each other. It's not really fair to say that we know how utility maximizers work when they outsource the important part to the assumptions.)
Gasses typically aren't assembled by trillions of repetitions of isolating an atom and inserting it into a container. Gas canisters are (I assume) assembled by e.g. compressing some resevoir (even simply a fraction of the atmosphere) or via a chemical reaction that produces the gas, and in these cases such procedures constitute the long-tailed variable that I am talking about in this series. (They are large relative to the individual particle velocities, and the particle velocities are a diminished form of the creation procedure as e.g. some ways of creating the gas leaves it more hot or similar.) Gasses in nature also have long-tailed causes, e.g. the atmosphere is collected due to the Earth's gravitational pull. (I think particles in outer space would technically not constitute a gas, but their velocities are AFAIK long-tailed due to coming from quasars and such.)
Generally you wouldn't since it's busy using that matter/energy for whatever you asked it to do. If you wanted to use it, presumably you could turn down its intensity, or maybe it exposes some simplified summary that it uses to coordinate economies of scale.
Once you start getting involved with governance, you're going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.
If you're assuming that verification is easier than generation, you're pretty much a non-player when it comes to alignment.
I'm not interested in your key property, I'm interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time, but your description is kind of too vague to say for sure.
I was considering doing something like this, but I kept getting stuck at the issue that it doesn't seem like gradients are an accurate attribution method. Have you tried comparing the attribution made by the gradients to a more straightforward attribution based on the counterfactual of enabling vs disabling a network component, to check how accurate they are? I guess I would especially be curious about its accuracy on real-world data, even if that data is relatively simple.
I don't understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?
For the former I'd need to hear your favorite argument in favor of the neurosis that inner alignment is a major problem.
For the latter, in the presence of adversaries, every subgoal has to be robust against those adversaries, which is very unfriendly.
I don't really understand how you expect this line of thought to play out. Are you arguing e.g. Sam Altman would start using OpenAI to enforce his own personal moral opinions, even when they are extremely unpopular?
I don't think the people develop AGI have clear or coherent wishes for how the AGI should treat most other people.
Most places where AI or alignment are applied are more convoluted cases where lots of people are involved. It's generally not economically feasible to develop AGI for a single person, so it doesn't really happen.
If you want AIs to produce a lot of text on AI alignment and moral philosophy, you can already do that now without worrying that the AIs in question will take over the world.
If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can't really know how to adapt and improve it without actually involving it in those conflicts.
However, if something like the plan from John Wentworth's post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.
How?
Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I'd encourage you to write up your results on that line of argument anyway.
I didn't really get any further than John Wentworth's post here. But also I've been a lot less spooked by LLMs than Eliezer Yudkowsky.
Pursuit of money is an extremely special instrumental goal whose properties you shouldn't generalize to other goals in your theory of instrumental convergence. (And I could imagine it should be narrowed down further, e.g. into those who want to support the state vs those who want money by whichever means including scamming the state.)
Is your mother currently spending a lot of her time writing novels?
What I eventually realized is that this line of argument is a perfect rebuttal of the whole mesa-optimization neurosis that has popped up, but it doesn't actually give us AI safety because it completely breaks down once you apply it to e.g. law enforcement or warfare.
(Certifications and regulations promise to solve this, but they face the same problem: they don't know what requirements to put up, an alignment problem.)
Thesis: Everything is alignment-constrained, nothing is capabilities-constrained.
Examples:
- "Whenever you hear a headline that a medication kills cancer cells in a petri dish, remember that so does a gun." Healthcare is probably one of the biggest constraints on humanity, but the hard part is in coming up with an intervention that precisely targets the thing you want to treat, I think often because knowing what exactly that thing is is hard.
- Housing is also obviously a huge constraint, mainly due to NIMBYism. But the idea that NIMBYism is due to people using their housing for investments seems kind of like a cope, because then you'd expect that when cheap housing gets built, the backlash is mainly about dropping investment value. But the vibe I get is people are mainly upset about crime, smells, unruly children in schools, etc., due to bad people moving in. Basically high housing prices function as a substitute for police, immigration rules and teacher authority, and those in turn are compromised less because we don't know how to e.g. arm people or discipline children, and more because we aren't confident enough about the targeting (alignment problem), and because we have a hope that bad people can be reformed if we could just solve what's wrong with them (again an alignment problem, because that requires defining what's wrong with them).
- Education is expensive and doesn't work very well; a major constraint on society. Yet those who get educated do get given exams which assess whether they've picked up stuff from the education, and they perform reasonably well. Seems a substantial part of the issue is that they get educated in the wrong things, an alignment problem.
- American GDP is the highest it's ever been, yet its elections are devolving into choosing between scammers. It's not even a question of ignorance, since it's pretty well-known that it's scammy (consider also that patriotism is at an all-time low).
Exercise: Think about some tough problem, then think about what capabilities you need to solve that problem, and whether you even know what the problem is well enough that you can pick some relevant capabilities.
From your talk on tensors, I am sure it will not surprise you at all to know that the sandwhich thing itself (mapping from operators to operators) is often called a superoperator.
Oh it does surprise me, superoperators are a physics term but I just know linear algebra and dabble in physics, so I didn't know that one. Like I'd think of it as the functor over vector spaces that maps .
I think the reason it is as it is is their isn't a clear line between operators that modify the state and those that represent measurements. For example, the Hamiltonian operator evolves the state with time. But, taking the trace of the Hamiltonian operator applied to the state gives the expectation value of the energy.
Hm, I guess it's true that we'd usually think of the matrix exponential as mapping to , rather than as mapping to . I guess it's easy enough to set up a differential equation for the latter, but it's much less elegant than the usual form.
Yes, applying a (0, 2) tensor to a (2, 0) tensor is like taking the trace of their composition if they were both regarded as linear maps.
Anyway for operators that are supposed to modify a state, like annihilation/creation or time-evolution, I would be inclined to model it as linear maps/(1, 1)-tensors like in the OP. It was specifically for observables that I meant it seemed most natural to use (0, 2) tensors.
Its a density matrix to density matrix map
I thought they were typically wavefunction to wavefunction maps, and they need some sort of sandwiching to apply to density matrices?
Dominance is (a certain kind of) nonlinearity on a single locus, epistasis is nonlinearity across different loci.
I feel like for observables it's more intuitive for them to be (0, 2) tensors (bilinear forms) whereas for density matrices it's more intuitive for them to be (2, 0) tensors. But maybe I'm missing something about the math that makes this problematic, since I haven't done many quantum calculations.
The way I (computer scientist who dabbles in physics, so YMMV I might be wrong) understand the physics here:
- Feynmann diagrams are basically a Taylor expansion of a physical system in terms of the strength of some interaction,
- To avoid using these Taylor expansions for everything, one tries to modify the parameters of the model to take a summary of the effects into account; for instance one distinguishes between the "bare mass", which doesn't take various interactions into account, versus the "effective mass", which does,
- Sometimes e.g. the Taylor series don't converge (or some integrals people derived from the Taylor expansions don't converge), but you know what the summary parameters turn out to be in the real world, and so you can just pretend the calculations do converge into whatever gives the right summary parameters (which makes sense if we understand the model is just an approximation given what's known and at some point the model breaks down).
Meanwhile, for ML:
- Causal scrubbing is pretty related to Taylor expansions, which makes it pretty related to Feynmann diagrams,
- However, it lacks any model for the non-interaction/non-Taylor-expanded effects, and so there's no parameters that these Taylor expansions can be "absorbed into",
- While Taylor expansions can obviously provide infinite detail, nobody has yet produced any calculations for causal scrubbing that fail to converge rather than simply being unreasonably complicated. This is partly because without the model above, there's not many calculations that are worth running.
I've been thinking about various ideas for Taylor expansions and approximations for neural networks, but I kept running in circles, and the main issue I've ended up with is this:
In order to eliminate noise, we need to decide what really matters and what doesn't really matter. However, purely from within the network, we have no principled way of doing so. The closest we get is what affects the token predictions for the network, but even that contains too may unimportant parameters, because if e.g. the network goes off on a tangent but then returns to the main topic, maybe that tangent didn't matter and we're fine with the approximation discarding it.
As a simplified version of this objection, consider that the token probabilities are not the final output of the network, but instead the tokens are sampled and fed back into the network, which means that really the final layer of the network is connected back to the first layer through a non-differentiable function. (The non-differentiability interferes with any interpretability method based on derivatives....)
What we really want to know is the impacts of the network in real-world scenarios, but it's hard to notice main consequences of the network, and even if we could, it's hard to set up measurable toy models of them. Once we had such toy models, it's unclear whether we'd even need elaborate techniques for interpreting them. If for instance Claude is breaking a generation of young nerds by praising any nonsensical thing they say by responding "Very insightful!", that doesn't really need any advanced interpretability techniques to be understood.
Oh I meant a (2, 0) tensor.