Posts
Comments
I'm not convinced Scott Alexander's mistakes page accurately tracks his mistakes. E.g. the mistake on it I know the most about is this one:
56: (5/27/23) In Raise Your Threshold For Accusing People Of Faking Bisexuality, I cited a study finding that most men’s genital arousal tracked their stated sexual orientation (ie straight men were aroused by women, gay men were aroused by men, bi men were aroused by either), but women’s genital arousal seemed to follow a bisexual pattern regardless of what orientation they thought they were - and concluded that although men’s orientation seemed hard-coded, women’s orientation must be more psychological. But Ozy cites a followup study showing that women (though not men) also show genital arousal in response to chimps having sex, suggesting women’s genital arousal doesn’t track actual attraction and is just some sort of mechanical process triggered by sexual stimuli. I should not have interpreted the results of genital arousal studies as necessarily implying attraction.
But that's basically wrong. The study found women's arousal to chimps having sex to be very close to their arousal to nonsexual stimuli, and far below their arousal to sexual stimuli.
I mean I don't really believe the premises of the question. But I took "Even if you're not a fan of automating alignment, if we do make it to that point we might as well give it a shot!" to imply that even in such a circumstance, you still want me to come up with some sort of answer.
Life on earth started 3.5 billion years ago. Log_2(3.5 billion years/1 hour) = 45 doublings. With one doubling every 7 months, that makes 26 years, or in 2051.
(Obviously this model underestimates the difficulty of getting superalignment to work. But also extrapolating the METR trend is questionable for 45 doublings is dubious in an unknown direction. So whatever.)
I talk to geneticists (mostly on Twitter, or rather now BlueSky) and they don't really know about this stuff.
(Presumably there exists some standard text about this that one can just link to lol.)
I don't think so.
I'm still curious whether this actually happens.... I guess you can have the "propensity" be near its ceiling.... (I thought that didn't make sense, but I guess you sometimes have the probability of disease for a near-ceiling propensity be some number like 20% rather than 100%?) I guess intuitively it seems a bit weird for a disease to have disjunctive causes like this, but then be able to max out at the risk at 20% with just one of the disjunctive causes? IDK. Likewise personality...
For something like divorce, you could imagine the following causes:
- Most common cause is you married someone who just sucks
- ... but maybe you married a closeted gay person
- ... or maybe your partner was good but then got cancer and you decided to abandon them rather than support them through the treatment
The genetic propensities for these three things are probably pretty different: If you've married someone who just sucks, then a counterfactually higher genetic propensity to marry people who suck might counterfactually lead to having married someone who sucks more, but a counterfactually higher genetic propensity to marry a closeted gay person probably wouldn't lead to counterfactually having married someone who sucks more, nor have much counterfactual effect on them being gay (because it's probably a nonlinear thing), so only the genetic propensity to marry someone who sucks matters.
In fact, probably the genetic propensity to marry someone who sucks is inversely related to the genetic propensity to divorce someone who encounters hardship, so the final cause of divorce is probably even more distinct from the first one.
Ok, more specifically, the decrease in the narrowsense heritability gets "double-counted" (after you've computed the reduced coefficients, those coefficients also get applied to those who are low in the first chunk and not just those who are high, when you start making predictions), whereas the decrease in the broadsense heritability is only single-counted. Since the single-counting represents a genuine reduction while the double-counting represents a bias, it only really makes sense to think of the double-counting as pathological.
It would decrease the narrowsense (or additive) heritability, which you can basically think of as the squared length of your coefficient vector, but it wouldn't decrease the broadsense heritability, which is basically the phenotypic variance in expected trait levels you'd get by shuffling around the genotypes. The missing heritability problem is that when we measure these two heritabilities, the former heritability is lower than the latter.
If some amount of heritability is from the second chunk, then to that extent, there's a bunch of pairs of people whose trait differences are explained by second chunk differences. If you made a PGS, you'd see these pairs of people and then you'd find out how specifically the second chunk affects the trait.
This only applies if the people are low in the first chunk and differ in the second chunk. Among the people who are high in the first chunk but differ in the second chunk, the logarithm of their trait level will be basically the same regardless of the second chunk (because the logarithm suppresses things by the total), so these people will reduce the PGS coefficients rather than increasing the PGS coefficients. When you create the PGS, you include both groups, so the PGS coefficients will be downwards biased relative to .
Some of the heritability would be from the second chunk of genes.
The original discussion was about how personality traits and social outcomes could behave fundamentally differently from biological traits when it comes to genetics. So this isn't necessarily meant to apply to disease risks.
Let's start with the basics: If the outcome is a linear function of the genes , that is , then the effect of each gene is given by the gradient of , i.e. . (This is technically a bit sketchy since a genetic variant is discrete while gradients require continuity, but it works well enough as a conceptual approximation for our purposes.) Under this circumstance, we can think of genomic studies as finding . (This is also technically a bit sketchy because of linkage disequillibrium and such, but it works well enough as a conceptual approximation for our purposes.)
If isn't a linear function, then there is no constant to find. However, the argument for genomic studies still mostly goes through that they can find , it's just that this expression now denotes a weird mismash effect size that's not very interpretable.
As you observed, if is almost-linear, for example if , then genomic studies still have good options. The best is probably to measure the genetic influence on , as then we get a pretty meaningful coefficient out of it. (If we measured the genetic influence of without the logarithm, I think under commonly viable assumptions we would get , but don't cite me on that.)
The trouble arises when you have deeply nonlinear forms such as . If we take the gradient of this, then the chain rule gives us . That is, the two different mechanisms "suppress" each other, so if is usually high, then the term would usually be (implicitly!) excluded from the analysis.
It kind-of applies to the Bernoulli-sigmoid-linear case that would usually be applied to binary diagnoses (but only because of sample size issues and because they usually perform the regression one variable at a time to reduce computational difficulty), but it doesn't apply as strongly as it does to the polynomial case, and it doesn't apply to the purely linear (or exponential-linear) case at all.
If you have a purely linear case, then the expected slope of a genetic variant onto an outcome of interest is proportional to the effect of the genetic variant.
The issue is in the polynomial case, the effect size of one genetic variant depends on the status of other genetic variants within the same term in the sum. Statistics gives you a sort of average effect size, but that average effect size is only going to be accurate for the people with the most common kind of depression.
It doesn't matter if depression-common is genetic or environmental. Depression-common leads to the genetic difference between your cases and controls to be small along the latent trait axis that causes depression-rare. So the effect gets estimated to be not-that-high. The exact details of how it fails depends on the mathematical method used to estimate the effect.
Not right now, I'm on my phone. Though also it's not standard genetics math.
Isn't the derivative of the full variable in one of the multiplicands still noticeable? Maybe it would help if you make some quantitative statement?
Taking the logarithm (to linearize the association) scales the derivative down by the reciprocal of the magnitude. So if one of the terms in the sum is really big, all the derivatives get scaled down by a lot. If each of the terms are a product, then the derivative for the big term gets scaled up to cancel out the downscaling, but the small terms do not.
I mean, I think depression is heritable, and I think there are polygenic scores that do predict some chunk of this. (From a random google: https://jamanetwork.com/journals/jamapsychiatry/fullarticle/2783096 )
Under the condition I mentioned, polygenic scores will tend to focus on the traits that cause the most common kind of depression, while neglecting other kinds. The missing heritability will be due to missing those other kinds.
It becomes more complex once you take the sum of the product of several things. At that point the log-additive effect of one of the terms in the sum disappears if the other term in the sum is high. If you've got a lot of terms in the sum and the distribution of the variables is correct, this can basically kill the bulk of common additive variance. Conceptually speaking, this can be thought of as "your system is a mixture of a bunch of qualitatively distinct things". Like if you imagine divorce or depression can be caused by a bunch of qualitatively unrelated things.
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc
Couldn't it also end if all the AI companies collapse under their own accumulated technical debt and goodwill lost to propaganda, and people stop wanting to use AI for stuff?
And as a separate note, I'm not sure what the appropriate human reference class for game-playing AIs is, but I challenge the assumption that it should be people who are familiar with games. Rather than, say, people picked at random from anywhere on earth.
Should maybe restrict it to someone who has read all the documentation and discussion for the game that exists on the internet.
The defining difference was whether they have contextually activating behaviors to satisfy a set of drives, on the basis that this makes it trivial to out-think their interests. But this ability to out-think them also seems intrinsically linked to them being adversarially non-robust, because you can enumerate their weaknesses. You're right that one could imagine an intermediate case where they are sufficiently far-sighted that you might accidentally trigger conflict with them but not sufficiently far-sighted for them to win the conflicts, but that doesn't mean one could make something adversarially robust under the constraint of it being contextually activated and predictable.
That would be ones that are bounded so as to exclude taking your manipulation methods into account, not ones that are truly unbounded.
That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.
I don't think of my argument as model-based vs heuristic-reactive, I mean it as unbounded vs bounded. Like you could imagine making a giant stack of heuristics that makes it de-facto act like an unbounded consequentialist, and you'd have a similar problem. Model-based agents only become relevant because they seem like an easier way of making unbounded optimizers.
If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage.
You can think of LLMs as a homeostatic agent where prompts generate unsatisfied drives. Behind the scenes, there's also a lot of homeostatic stuff going on to manage compute load, power, etc..
Homeostatic AIs are not going to be trading partners because it is preferable to run them in a mode similar to LLMs instead of similar to independent agents.
Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?
Let's say a think tank is trying to use AI to infiltrate your social circle in order to extract votes. They might be sending out bots to befriend your friends to gossip with them and send them propaganda. You might want an agent to automatically do research on your behalf to evaluate factual claims about the world so you can recognize propaganda, to map out the org chart of the think tank to better track their infiltration, and to warn your friends against it.
However, precisely specifying what the AI should do is difficult for standard alignment reasons. If you go too far, you'll probably just turn into a cult member, paranoid about outsiders. Or, if you are aggressive enough about it (say if we're talking a government military agency instead of your personal bot for your personal social circle), you could imagine getting rid of all the adversaries, but at the cost of creating a totalitarian society.
(Realistically, the law of earlier failure is plausibly going to kick in here: partly because aligning the AI to do this is so difficult, you're not going to do it. But this means you are going to turn into a zombie following the whims of whatever organizations are concentrating on manipulating you. And these organizations are going to have the same problem.)
Homeostatic agents are easily exploitable by manipulating the things they are maintaining or the signals they are using to maintain them in ways that weren't accounted for in the original setup. This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.
As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking "why would my coding assistant/tutor bot turn evil?", try asking "why would my bot that I'm using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?".
Though obviously we're not yet at the point where we have this kind of bot, and we might run into law of earlier failure beforehand.
What if humanity mistakenly thinks that ceding control voluntarily is temporary, when actually it is permanent because it makes the systems of power less and less adapted to human means of interaction?
When asking this question, do you include scenarios where humanity really doesn't want control and is impressed by the irreproachability of GPTs, doing our best to hand over control to them as fast as possible, even as the GPTs struggle and only try in the sense that they accept whatever tasks are handed to them? Or do the GPTs have to in some way actively attempt to wrestle control from or trick humans?
Consider this model.
Suppose the state threatens people to do the following six things for their citizens:
* Teach the young
* Cure the sick
* Maintain law and order
* Feed, clothe and house people with work injuries
* Feed, clothe and house the elderly
* Feed, clothe and house people with FUBAR agency
(Requesting roughly equally many resources to be put into each of them.)
People vary in how they react to the threats, having basically three actions:
1. Assist with what is asked
2. Develop personal agency for essentially-selfish reasons, beyond what is useful on the margin to handle the six tasks above
3. Using the tokens the government provides to certify the completion of the threatened tasks, put citizens in charge of executing similar tasks for foreigners
The largest scale of assisting with what is asked could be to find areas with powerful economies of scale, for instance optimizing the efficiency with which food and clothing is distributed to citizens. However, economies of scale require homogenous tasks, which means that the highest extremes of action 1 trades negatively against extremes of action 2, as one develops narrower specialization while neglecting the general end-to-end agency.
One cannot do much of action 3 without also doing a lot of action 1, so wealth inequality correlates to a focus on economies of scale.
I'm not sure which of "oppression" and "production" this scenario corresponds to under your model.
Similar to the "production" scenario, the production under this model seems to be "real", for instance people are getting clothed and the people who are handsomely rewarded for this are contributing a lot of marginal value. However, unlike the "production" scenario, the wealth doesn't straightforwardly applying knowing better than others. One might know better with respect to one's specialty, but the flipside is that one has neglected the development of skills outside of that specialty (potentially due to starting out with less innate ability to develop them, e.g. a physical disability or lack of connectedness to tutors).
Meanwhile, the scenario I described here doesn't resemble "oppression" at all, except for the original part where the state threatens people to perform the various government services instead of improving their own agency. I get the impression that your oppression hypothesis is more concerned that people provide a simulacrum of these products to the state than that people are forced to provide a genuine version of these products in the most efficient possible way. I do see a strong case for the simulacrum model, but my comment here seems like a relevant alternative to consider, unless I am missing something.
I feel like the case of bivariate PCA is pretty uncommon. The classic example of PCA is over large numbers of variables that have been transformed to be short-tailed and have similar variance (or which just had similar/small variance to begin with before any transformations). Under that condition, PCA gives you the dimensions which correlate with as many variables as possible.
4) The human brain has many millions of idiosyncratic failure modes. We all display hundreds of them. The psychological disorders that we know of are all extremely rare and extremely precise, so if you ever met two people with the same disorder it would be obvious. Named psychological disorders are the result of people with degrees noticing two people who actually have the same disorder and other people reading their descriptions and pattern-matching noise against it. There are, for instance, 1300 bipolar people (based on the actual precise pattern which inspired the invention of the term) in the world but hundreds of thousands of people have disorders which if you squint hard look slightly like bipolar.
This seems mostly believable, except often (not always, I suspect) people name disorders less precisely than this.
I think the clearest problems in current LLMs are what I discussed in the "People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine." section. And this is probably a good example of what you are saying about how "Niceness can be hostile or deceptive in some conditions.".
For example, the issue of outsourcing tasks to an LLM to the point where one becomes dependent on it is arguably an issue of excessive niceness - though not exactly to the point where it becomes hostile or deceptive. But where it then does become deceptive in practice is that when you outsource a lot of your skills to the LLM, you start feeling like the LLM is a very intelligent guru that you can rely on, and then when you come up with a kind of half-baked idea, the RLHF makes the LLM praise you for your insight.
A tricky thing with a claim like "This LLM appears to be nice, which is evidence that it is nice." is what it means for it to "be nice". I think the default conception of niceness is as a general factor underlying nice behaviors, where a nice behavior is considered something like an action that alleviates difficulties or gives something desired, possibly with the restriction that being nice is the end itself (or at least, not a means to an end which the person you're treating nicely would disapprove of).
The major hurdle in generalizing this conception to LLMs is in this last restriction - both in terms of which restriction to use, and in how that restriction generalizes to LLMs. If we don't have any restriction at all, then it seems safe to say that LLMs are typically inhumanly nice. But obviously OpenAI makes ChatGPT so nice in order to get subscribers to earn money, so that could be said to violate the ulterior motive restriction. But it seems to me that this is only really profitable due to the massive economies of scale, so on a level of an individual conversation, the amount of niceness seems to exceed the amount of money transferred, and seems quite unconditional on the money situation, so it seems more natural to think of the LLM as being simply nice for the purpose of being nice.
I think the more fundamental issue is that "nice" is a kind of confused concept (which is perhaps not so surprising considering the etymology of "nice"). Contrast for instance the following cultures:
- Everyone has strong, well-founded opinions on what makes for a good person, and they want there to be more good people. Because of these norms, they collaborate to teach each other skills, discuss philosophy, resolve neuroses, etc., to help each other be good, and that makes them all very good people. This goodness makes them all like and enjoy each other, and thus in lots of cases they conclude that the best thing they could do is to alleviate each other's difficulties and give each other things they desire (even from a selfish perspective, as empowering the others means that the others are better and do more nice stuff).
- Nobody is quite sure how to be good, but everyone is quite sure that goodness is something that makes people increase social welfare/sum of utility. Everyone has learned that utility can be elicited by choices/revealed preferences. They look at what actions others take and on how they seem to feel in order to deduce information about goodness. This often leads to them executing nice behaviors because nice behaviors consistently makes people feel better in the period shortly after they were executed.
They're both "nice", but the niceness of the two cultures have fundamentally different mechanisms with fundamentally different root causes and fundamentally different consequences. Even if they might both be high on the general factor of niceness, most nice behaviors have relatively small consequences, and so the majority of the consequence of their niceness is not determined by the overall level of the general factor of niceness, but instead by the nuances and long tails of their niceness, which differs a lot between the two cultures.
Now, LLMs don't do either of these, because they're not human and they don't have enough context to act according to either of these mechanisms. I don't think one can really compare LLMs to anything other than themselves.
I think the billion-dollar question is, what is the relationship between these two perspectives? For example, a simplistic approach would be to see cognitive visualization as some sort of Monte Carlo version of spreadsheet epistemology. I think that's wrong, but the correct alternative is less clear. Maybe something involving LDSL, but LDSL seems far from the whole story.
Are we missing a notion of "simulacrum level 0"? That is, in order to accurately describe the truth, we need some method of synchronizing on a common language. In the beginning of a human society, this can be basic stuff like pointing at objects and making sounds in order to establish new words. But also, I would be inclined to say that more abstract stuff like discussing the purpose for using the words or planning truth-determination-procedures also go in simulacrum level 0. I'd say the entire discussion of simulacrum levels goes within simulacrum level 0.
Or if simulacrum levels aren't exactly the right term, here's what I have in mind as levels of communication:
- Synchronizing (level 0): establishing and maintaining the meaning of terms for describing the world
- Objective (level 1): truthfully describing the world to the best of one's ability
- Manipulative (level 2): saying known false or unfounded things to exploit other's use of language to control them
- Framing (level 3): the norms for maintaining truth no longer succeed, but they are still in operation and punish or reward people, so people try to act in ways that maintain their reputation despite them not tracking truth
- Activating (level 4): the norms for maintaining truth are no longer in place, but some systems still rely on the old symbolic language as keywords to perform certain behaviors, so language is still used to interface with these systems
Yeah, this seems like a reasonable restatement of my question.
I guess my main issue with this approach is that extrapolating the distribution of activations from a dataset isn't what I'd consider the hard part of alignment. Rather, it would be:
- Detecting catastrophic outputs and justifying their catastrophicness to others. (In particular, I suspect no individual output will be catastrophic on the margin regardless of whether catastrophe will occur. Either the network will consistently avoid giving catastrophic outputs, or it will sufficiently consistently be harmful that localizing the harm to 1 output will not be meaningful.)
- Learning things about the distribution of inputs that cannot be extrapolated from any dataset. (In particular, the most relevant short-term harm I've noticed would be stuff like young nerds starting to see the AI as a sort of mentor and then having their questionable ideas excessively validated by this mentor rather than receiving appropriate pushback. This would be hard to extrapolate from a dataset, even though it is relatively obvious if you interact with certain people. Though whether that counts as "catastrophic" is a complicated question.)
This is kind of vague. Doesn't this start shading into territory like "it's technically not bad to kill a person if you also create another person"? Or am I misunderstanding what you are getting at?
Population ethics is the most important area within utilitarianism, but utilitarian answers to population ethics are all wrong, so therefore utilitarianism is an incorrect moral theory.
You can't weasel your way out by calling it an edge-case or saying that utilitarianism "usually" works when really it's the most important moral question. Like all the other big-impact utilitarian conclusions derive from population ethics since they tend to be dependent on large populations of people.
Utilitarianism can at best be seen as like a Taylor expansion that's valid only for questions whose impact on the total population are negligible.
Maybe to expand: In order to get truly good training loss on an autoregressive training objective, you probably need to have some sort of intelligence-like or agency-like dynamic. But much more importantly, you need a truly vast amount of knowledge. So most of the explanation for the good performance comes from the knowledge, not the intelligence-like dynamic.
(Ah, but intelligence is more general, so maybe we'd expect it to show up in lots of datapoints, thereby making up a relatively big chunk of the training objective? I don't think so, for two reasons: 1) a lot of datapoints don't really require much intelligence to predict, 2) there are other not-very-intelligence-requiring things like grammar or certain aspects of vocabulary which do show up in a really big chunk.)
Would "the neural network has learned a lookup table with a compressed version of the dataset and interpolates on that in order to output its answers" count as an explanation of the low dataset loss?
(Note, this phrasing kind of makes it sound too simple. Since the explanations you are seeking presumably don't come with the dataset baked-in as a thing they can reference primitively, presumably the actual formal statement would need to include this entire compressed lookup table. Also, I'm imagining a case where there isn't really a "compression algorithm" because the compression is intimately tied up with the neural network itself, and so it's full of ad-hoc cases.)
Like I guess from an alignment perspective this could still be useful because it would be nice to know to what extent "bag of heuristics" holds, and this is basically a formalization of that. But at the same time, I already roughly speaking (with lots of asterisks, but not ones that seem likely to be addressed by this) expect this to hold, and it doesn't really rule out other dangers (like those heuristics could interact in a problematic way), so it seems kind of like it would just lead to a dead-end from my perspective.
If this is meant to be a weakening of NP vs co-NP, what do you make of the stronger statement that NP = co-NP? As I understand it, this most complexity theorists think this is false. Do you have any reason to think that your conjecture is much much more likely to hold than NP = co-NP, or do you also think NP = co-NP could hold?
Maybe I'm missing something, but if we are estimating the P(Xi), how can we also have Xi on RHS?
These probabilities are used for scoring predictions over the observed variables once the market resolves, so at that point we "don't need" P(Xi) because we already know what Xi is. The only reason we compute it is so we can reward people who got the prediction right long ago before Xi was known.
and what is the adjustment +(1−Xi)(1−qi,j). why is that there?
Xiqi,j+(1−Xi)(1−qi,j) is equivalent to "qi,j if Xi = 1; otherwise 1-qi,j if Xi = 0". It's basically a way to mathematize the "contigency table" aspect.
And they wouldn't be getting any profit. (In the updated comment, I noted it's only the profit that measures your trouble.)
Exports and imports are tricky but very important to take into account here because they have two important properties:
* They are "subtracted off" the GDP numbers in my explanation above (e.g. if you import a natural resource, then that would be considered part of the GDP of the other country, not your country)
* They determine the currency exchange rates (since the exchange rate must equal the ratio of imports to exports, assuming savings and bonds are negligible or otherwise appropriately accounted for) and thereby the GDP comparisons across different countries at any given time
Prices decompose into cost and profit. The profit is determined by how much trouble the purchaser would be in if the seller didn't exist (since e.g. if there's other sellers, the purchaser could buy from those). The cost is determined by how much demand there is for the underlying resources in other areas, so it basically is how much trouble the purchaser imposes on others by getting the item. Most products are either cost-constrained (where price is mostly cost) or high-margin (where price is mostly profit).
GDP is price times transaction volume, so it's the sum of total costs and total profits in a society. The profit portion of GDP reflects the extent to which the economy has monopolized activities into central nodes that contribute to fragility, while the cost portion of GDP reflects the extent to which the economy is resource-constrained.
The biggest costs in a modern economy is typically labor and land, and land is typically just a labor cost by proxy (land in the middle of nowhere is way cheaper, but it's harder to hire people). The majority of the economy is cost-constrained, so for that majority, GDP reflects underpopulation. The tech sector and financial investment sector have high profit margins, which reflects their tendency to monopolize management of resources.
Low GDP reflects slack. Because of diminishing marginal returns and queuing considerations, ideally one should have some slack, since then there's abundance of resources and easy competition, driving prices down and thus leading to low GDP at high quality of life. However, slack also leads to conflict because of reduced opportunity cost. This conflict can be reduced with policing, but that increases authoritarianism. This leads to a tradeoff between high GDP and high tension (as seen in the west) vs low GDP and high authoritarianism (as seen in the east) vs low GDP and high conflict (as seen in the south).
Hmm... Issue is it also depends on centralization. For a bunch of independent transactions, fragility goes up with the square root of the count rather than the raw count. In practice large economies are very much not independent, but the "troubles" might be.
It's elementary that the derivative approaches zero when one of the inputs to a softmax is significantly bigger than the others. Then when applying the chain rule, this entire pathway for the gradient gets knocked out.
I don't know to what extent it comes up with modern day LLMs. Certainly I bet one could generate a lot of interpretability work within the linear approximation regime. I guess at some point it reduces to the question of why to do mechanistic interpretability in the first place.
Framing: Prices reflect how much trouble purchasers would be in if the seller didn't exist. GDP multiplies prices by transaction volume, so it measures the fragility of the economy.
I would be satisfied with integrated gradients too. There are certain cases where pure gradient-based attributions predictably don't work (most notably when a softmax is saturated) and those are the ones I'm worried about (since it seems backwards to ignore all the things that a network has learned to reliably do when trying to attribute things, as they are presumably some of the most important structure in the network).
I would be curious what you think of [this](https://www.lesswrong.com/posts/TCmj9Wdp5vwsaHAas/knocking-down-my-ai-optimist-strawman).
Ah, I see. I've gone and edited my rebuttal to be more forceful and less hedgy.
Strawman and steelman arguments are the same thing. It's just better to label them "strawman" so rather than "steelman" so you don't overestimate their value.
I'm not sure what you mean by "K-means clustering baseline (with K=1)". I would think the K in K-means stands for the number of means you use, so with K=1, you're just taking the mean direction of the weights. I would expect this to explain maybe 50% of the variance (or less), not 90% of the variance.
But anyway, under my current model (roughly Why I'm bearish on mechanistic interpretability: the shards are not in the network + Binary encoding as a simple explicit construction for superposition) it seems about as natural to use K-means as it does to use SAEs, and not necessarily an issue if K-means outperforms SAEs. If we imagine that the meaning is given not by the dimensions of the space but rather by regions/points/volumes of the space, then K-means seems like a perfectly cromulent quantization for identifying these volumes. The major issue is where we go from here.