LessWrong 2.0 Reader

View: New · Old · Top

next page (older posts) →

next page (older posts) →

Recent comments

nora-belrose on Inducing Unprompted Misalignment in LLMs

Unclear why this is supposed to be a scary result.

"If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains" - Matthew Barnett

buck on Express interest in an "FHI of the West"

I'll also note that if you want to show up anywhere in the world and get good takes from people on the "how aliens might build AGI" question, Constellation might currently be the best bet (especially if you're interested in decision-relevant questions about this).

eggsyntax on Transformers Represent Belief State Geometry in their Residual Stream

I struggled with the notation on the figures; this comment tries to clarify a few points for anyone else who may be confused by it.

There are three main diagrams to pay attention to in order to understand what's going on here:
- The Z1R Process (this is a straightforward Hidden Markov Model diagram, look them up if it's unclear).
- The Z1R Mixed-State Presentation, representing the belief states of a model as it learns the underlying structure.
- The Z1R Mixed-State Simplex. Importantly, unlike the other two this is a graph and spatial placement is meaningful.
It's better to ignore the numeric labels on the green nodes of the Mixed-State Presentation, at least until you're clear about the rest. These labels are not uniquely determined,^[1] so the relationship between the subscripts can be very confusing. Just treat them as arbitrarily labeled distinct nodes whose only importance is the arrows leading in and out of them. Once you understand the rest you can go back and understand the subscripts if you want.
However, it's important to note that the blue nodes are isomorphic to the Z1R Process diagram (n_101 = SR, n_11 = S0, n_00 = S1. Once the model has entered the correct blue node, it will thereafter be properly synchronized to the model. The green nodes are transient belief states that the model passes through on its way to fully learning the model.
On the Mixed-State Simplex: I found the position on the diagram quite confusing at first. The important thing to remember is that the three corners represent certainty that the underlying process is in the equivalent state (eg the top corner is n_00 = S1). So for example if you look at n_1, the model is confident that the underlying process is definitely not in n_11 (S0), since it's as far as possible from that corner. And the model believes that the process is more likely to be in n_101 (SR) than in n_00 (S1). Note how this corresponds to the arrows leaving n_00 & their probabilities in the Mixed-State Presentation (67% chance of transitioning to n_101, 33% chance of transitioning to n_00).
Some more detail on n_0 if this isn't clear after the previous paragraph:
- If we're in n_0, we've just seen a 0 (per the mixed-state presentation).
- That means that there's a 2/3 chance we're currently in S1, and a 1/3 chance we're currently in S0. And, of course, a 0 chance that we're currently in SR.
- Therefore the point on which n_0 should lie should be maximally far from the SR corner (n_101), and closer to the S1 corner (n_00) than to the S0 corner (n_11). Which is what we in fact see.

@Adam Shai [LW · GW] please correct me if I got any of that wrong!

^{^}
Here's the details if you want them after you've understood the rest. Each node label represents some path that could be taken to that node (& not to other nodes), but there can be multiple such paths. For example, n_11 could also be labeled as n_010, because those are both sequences that could have left us in that state. So as we take some path through the Mixed-State Presentation, we build up a path. If we start at n_s and follow the 1 path, we get to n_1. If we follow the 0 path, we reach n_10. If we then follow the 0 path, the next node could be called n_100, reflecting the path we've taken. But in fact any path that goes through 00 will reach that node, so it's just labeled n_00. So initially it seems as though we can just append the symbol emitted by whichever path we take, but often there's some step where that breaks down and you get what initially seems like a totally random different label.

buck on Express interest in an "FHI of the West"

(I work out of Constellation and am closely connected to the org in a bunch of ways)

I think you're right that most people at Constellation aren't going to seriously and carefully engage with the aliens-building-AGI question, but I think describing it as a difference in culture is missing the biggest factor leading to the difference: most of the people who work at Constellation are employed to do something other than the classic FHI activity of "self-directed research on any topic", so obviously aren't as inclined to engage deeply with it.

I think there also is a cultural difference, but my guess is that it's smaller than the effect from difference in typical jobs.

wassname on Evolution did a surprising good job at aligning humans...to social status

We establish institutions to channel and utilize status-seeking behavior by putting us in status conscious groups where we have ceremonies and titles that draw our attention to status. This work! Is it more effective to educate a child individually or in a group of peers? Is it easier to lead a solitary soldier or a whole squad? Do people seek a promotion or a pay rise?

From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.

We also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.

adam-shai on Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer

Thanks John and David for this post! This post has really helped people to understand the full story. I'm especially interested in thinking more about plans for how this type of work can be helpful for AI safety. I do think the one you presented here is a great one, but I hope there are other potential pathways. I have some ideas, which I'll present in a post soon, but my views on this are still evolving.

adam-shai on Transformers Represent Belief State Geometry in their Residual Stream

Thanks! I'll have more thorough results to share about layer-wise reprsentations of the MSP soon. I've already run some of the analysis concatenating over all layers residual streams with RRXOR process and it is quite interesting. It seems there's a lot more to explore with the relationship between number of states in the generative model, number of layers in the transformer, residual stream dimension, and token vocab size. All of these (I think) play some role in how the MSP is represented in the transformer. For RRXOR it is the case that things look crisper when concatenating.

Even for cases where redundant info is discarded, we should be able to see the distinctions somewhere in the transformer. One thing I'm keen on really exploring is such a case, where we can very concretely follow the path/circuit through which redundant info is first distinguished and then is collapsed.

dagon on Experiment on repeating choices

For me, these topics seem extremely contextual and variable with the situation and specifics of the tradeoff in the moment. For many of them, I do somewhat frequently explore consciously what it might feel like (and for cheap ones, try out) to make a different tradeoff, but those experiments don't generalize well.

I suspect that for the impactful ones (heavily repeated or large), your first two bullet points don't apply - feedback is delayed from the decision, and if harmful, it will be significant.

Still, it's VERY GOOD to be reminded that these decisions are mostly made by type-1 thinking, out of habit or instinct (aka deep/early learning) that deserves reconsideration from time to time.

lost-futures on AI #60: Oh the Humanity

The Devin mishap is a reminder of how tricky it often is for the general public to gauge what's currently possible and what isn't for AI. A lot of people, including myself, assumed the claimed performance was legitimate. No doubt many AI startups like Devin are waiting for the rising tide of improving foundational models to make their ideas feasible. I wonder how many are engaging in similar deceptive marketing tactics or will do so in the future.

romeostevensit on Raemon's Shortform

CRPGs with a lot of open world dynamics might work, where the goal is for the person to identify the most important experiments to run in a limited time window in order to manmax certain stats.

LessWrong 2.0 Reader

Archive

Recent comments