LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
Yeah, seems like a kinda bad feedback loop. It doesn't seem to usually happen in that the comments I've seen upvoted in that section usually don't get this extremely many upvotes on a comment this short.
I don't have a great solution. We could do something that's more clever and algorithmic, which doesn't seem crazy but I am also hesitant to do because it's a lot of work and also I like more straightforward and simple algorithms for transparency reasons.
daemonicsigil on Explaining a Math Magic TrickOh, very cool, thanks! Spoiler tag in markdown is:
:::spoiler
text here
:::
algon on Thomas Kwa's ShortformI think you should write it. It sounds funny and a bunch of people have been calling out what they see as bad arguements that alginment is hard lately e.g. TurnTrout, QuintinPope, ZackMDavis, and karma wise they did fairly well.
declan-molony on Rejecting TelevisionTo make an analogy to diet, you essentially replaced a sugar fix from eating Snickers bars with eating strawberries. Gradation matters!
I had a similar slide with my technologies, as I explained in the post. I eventually landed on reading books. But even that became a form of intellectual procrastination as I wrote in my latest LW post [LW · GW].
decaeneus on Decaeneus's ShortformPretending not to see when a rule you've set is being violated can be optimal policy in parenting sometimes (and I bet it generalizes).
Example: suppose you have a toddler and a "rule" that food only stays in the kitchen. The motivation is that each time food is brough into the living room there is a small chance of an accident resulting in a permanent stain. There's cost to enforcing the rule as the toddler will put up a fight. Suppose that one night you feel really tired and the cost feels particularly high. If you enforce the rule, it will be much more painful than it's worth in that moment (meaning, fully discounting future consequences). If you fail to enforce the rule, you undermine your authority which results in your toddler fighting future enforcement (of this and possibly all other rules!) much harder, as he realizes that the rule is in fact negotiable / flexible.
However, you have a third choice, which is to credibly pretend to not see that he's doing it. It's true that this will undermine your perceived competence, as an authority, somewhat. However, it does not undermine the perception that the rule is to be fully enforced if only you noticed the violation. You get to "skip" a particularly costly enforcement, without taking steps back that compromise future enforcement much.
I bet this happens sometimes in classrooms (re: disruptive students) and prisons (re: troublesome prisoners) and regulation (re: companies that operate in legally aggressive ways).
Of course, this stops working and becomes a farce once the pretense is clearly visible. Once your toddler knows that sometimes you pretend not to see things to avoid a fight, the benefit totally goes away. So it must be used judiciously and artfully.
tailcalled on Mechanistically Eliciting Latent Behaviors in Language ModelsI think it's easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there's a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter.
For neural networks, I sort of assume there's a similar thing going on, except it's quite hard to define it precisely. In technical terms, neural networks lack a privileged basis which distinguishes different components of the network, so one cannot pick a discrete component and ask whether it runs and if so how it runs.
This is a somewhat different definition of "on-manifold" than is usually used, as it doesn't concern itself with the real-world data distribution. Maybe it's wrong of me to use the term like that, but I feel like the two meanings are likely to be related, since the real-world distribution of data shaped the inner workings of the neural network. (I think this makes most sense in the context of the neural tangent kernel, though ofc YMMV as the NTK doesn't capture nonlinearities.)
In principle I don't think it's always important to stay on-manifold, it's just what one of my lines of thought has been focused on. E.g. if you want to identify backdoors, going off-manifold in this sense doesn't work.
I agree with you that it is sketchy to estimate the manifold from wild empiricism. Ideally I'm thinking one could use the structure of the network to identify the relevant components for a single input, but I haven't found an option I'm happy with.
Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.
Maybe. But isn't optimization in token-space pretty flexible, such that this is a relatively weak test?
Realistically steering vectors can be useful even if they go off-manifold, so I'd wait with trying to measure how on-manifold stuff is until there's a method that's been developed to specifically stay on-manifold. Then one can maybe adapt the measurement specifically to the needs of that method.
metachirality on metachirality's ShortformSure, I just prefer a native bookmarking function.
steve2152 on Biorisk is an Unhelpful Analogy for AI RiskThis book argues (convincingly IMO) that it’s impossible to communicate, or even think, anything whatsoever, without the use of analogies.
Etc. Right?
If you show me an introduction to AI risk for amateurs that you endorse, then I will point out the “rhetorical shortcuts that imply wrong and misleading things” that it contains—in the sense that it will have analogies between powerful AI and things-that-are-not-powerful-AI, and those analogies will be misleading in some ways (when stripped from their context and taken too far). This is impossible to avoid.
Anyway, if someone says:
When it comes to governing technology, there are some areas, like inventing new programming languages, where it’s awesome for millions of hobbyists to be freely messing around; and there are other areas, like inventing new viruses, or inventing new uranium enrichment techniques, where we definitely don’t want millions of hobbyists to be freely messing around, but instead we want to be thinking hard about regulation and secrecy. Let me explain why AI belongs in the latter category…
…then I think that’s a fine thing to say. It’s not a rhetorical shortcut, rather it’s a way to explain what you’re saying, pedagogically, by connecting it to the listener’s existing knowledge and mental models.
zershaaneh-qureshi on Now THIS is forecasting: understanding Epoch’s Direct ApproachThe point of the paragraph that the above quote was taken from is, I think, better summarised in its first sentence:
although Epoch takes an approach to forecasting TAI that is quite different to others in this space, its resulting probability distribution is not vastly dissimilar to those produced by other influential models
It is fair to question whether these two forecasts are “not vastly dissimilar” to one another. In some senses, two decades is a big difference between medians: for example, we suspect that a future where TAI arrives in the 2030s looks pretty different from a strategic perspective to one where TAI arrives in the 2050s.
But given the vast size of the possible space of AI timelines, and the fact that the two models compared here take two meaningfully different approaches to forecasting them, we think it’s noteworthy that their resulting distributions still fall in a similar ballpark of “TAI will probably arrive in the next few decades”. (In my previous post, Timelines to Transformative AI [LW · GW], I observed that a majority of recent timeline predictions fall in the rough ballpark of 10-40 years from now, and considered what we should make of that finding and how seriously we should take it.) It shows that we can make major changes in our assumptions but still come to the rough conclusion that TAI is a prospect for the relatively near term future, well within the lifetimes of many people alive today.
Also, I think the results of the Epoch model and the Cotra model are perhaps more similar than this two-decade gap might initially suggest. In the section [LW · GW] where we investigated varying non-empirically-tested inputs to the Epoch model, we found that making (what seemed to be) reasonable adjustments skewed the resulting median a few decades later. (Scott Alexander also tried something similar [LW · GW] with the Cotra model and observed a small degree of variation there.) Given the uncertainty over the Epoch model’s parameters and the scale of variation seen when adjusting them, a two decade gap between the medians from the (default versions of the) Epoch forecast and the Cotra forecast is not as vast a difference as it might at first seem.
If this seems like a helpful clarification, we can add a note about this in the article itself. :)
notfnofn on Explaining a Math Magic TrickVery nice! Notice that if you write r=j−k, I as D−1, and play around with binomial coefficients a bit, we can rewrite this as:
D−k(fp)=∞∑r=0(−kr)(D−k−rf)(Drp)which holds for k<0 as well, in which case it becomes the derivative product rule. This also matches the formal power series expansion of (x+y)−k, which one can motivate directly
(By the way, how do you spoiler tag?)