LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

← previous page (newer posts) · next page (older posts) →

Recent comments

sam-svenningsen on Inducing Unprompted Misalignment in LLMs

I'm not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.

The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one's hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.

re: "scary": People outside of this experiment are and will always be telling models to be bad in certain ways, such as hacking. Hacking LLMs are a thing that exists and will continue to exist. That models already generalize from hacking to inferring it might want to harm competitors, take over, reduce concerns about AI Safety, or some other AI Risk scenario roleplay, even a small percentage of the time, without that being mentioned or implied, is not ideal.

quinn-dougherty on Quinn's Shortform

Thinking about a top-level post on FOMO and research taste

Fear of missing out defined as inability to execute on a project cuz there's a cooler project if you pivot
but it also gestures at more of a strict negative, where you think your project sucks before you finish it, so you never execute
was discussing this with a friend: "yeah I mean lesswrong is pretty egregious cuz it sorta promotes this idea of research taste as the ability to tear things down, which can be done armchair"
I've developed strategies to beat this FOMO and gain more depth and detail with projects (too recent to see returns yet, but getting there) but I also suspect it was nutritious of me to develop discernment about what projects are valuable or not valuable for various threat models and theories of change (in such a way that being a phd student off of lesswrong wouldn't have been as good in crucial ways, tho way better in other ways).

Idk maybe this shortform is most of the value of the top level post

ann-brown on What's up with all the non-Mormons? Weirdly specific universalities across LLMs

On the other end of the spectrum, asking cosmo-1b (mostly synthetic training) for a completion, I get `A typical definition of "" would be "the set of all functions from X to Y".`

bhauth on hydrogen tube transport

Using sliding electrical contacts for power is fine for current high-speed trains, but it doesn't work as well above 200 m/s.

nina-rimsky on Progress Update #1 from the GDM Mech Interp Team: Full Update

I expect if you average over more contrast pairs, like in CAA (https://arxiv.org/abs/2312.06681), more of the spurious features in steering vectors are cancelled out leading to higher quality vectors and greater sparsity in the dictionary feature domain. Did you find this?

nora-belrose on Inducing Unprompted Misalignment in LLMs

Unclear why this is supposed to be a scary result.

"If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains" - Matthew Barnett

buck on Express interest in an "FHI of the West"

I'll also note that if you want to show up anywhere in the world and get good takes from people on the "how aliens might build AGI" question, Constellation might currently be the best bet (especially if you're interested in decision-relevant questions about this).

eggsyntax on Transformers Represent Belief State Geometry in their Residual Stream

I struggled with the notation on the figures; this comment tries to clarify a few points for anyone else who may be confused by it.

There are three main diagrams to pay attention to in order to understand what's going on here:
- The Z1R Process (this is a straightforward Hidden Markov Model diagram, look them up if it's unclear).
- The Z1R Mixed-State Presentation, representing the belief states of a model as it learns the underlying structure.
- The Z1R Mixed-State Simplex. Importantly, unlike the other two this is a graph and spatial placement is meaningful.
It's better to ignore the numeric labels on the green nodes of the Mixed-State Presentation, at least until you're clear about the rest. These labels are not uniquely determined, so the relationship between the subscripts can be very confusing. Just treat them as arbitrarily labeled distinct nodes whose only importance is the arrows leading in and out of them. Once you understand the rest you can go back and understand the subscripts if you want^[1].
However, it's important to note that the blue nodes are isomorphic to the Z1R Process diagram (n_101 = SR, n_11 = S0, n_00 = S1. Once the model has entered the correct blue node, it will thereafter be properly synchronized to the model. The green nodes are transient belief states that the model passes through on its way to fully learning the model.
On the Mixed-State Simplex: I found the position on the diagram quite confusing at first. The important thing to remember is that the three corners represent certainty that the underlying process is in the equivalent state (eg the top corner is n_00 = S1). So for example if you look at n_1, the model is confident that the underlying process is definitely not in n_11 (S0), since it's as far as possible from that corner. And the model believes that the process is more likely to be in n_101 (SR) than in n_00 (S1). Note how this corresponds to the arrows leaving n_00 & their probabilities in the Mixed-State Presentation (67% chance of transitioning to n_101, 33% chance of transitioning to n_00).
Some more detail on n_0 if this isn't clear after the previous paragraph:
- If we're in n_0, we've just seen a 0 (per the mixed-state presentation).
- That means that there's a 2/3 chance we're currently in S1, and a 1/3 chance we're currently in S0. And, of course, a 0 chance that we're currently in SR.
- Therefore the point on which n_0 should lie should be maximally far from the SR corner (n_101), and closer to the S1 corner (n_00) than to the S0 corner (n_11). Which is what we in fact see.

@Adam Shai [LW · GW] please correct me if I got any of that wrong!

^{^}
Here's the details if you still want them after you've understood the rest. Each node label represents some path that could be taken to that node (& not to other nodes), but there can be multiple such paths. For example, n_11 could also be labeled as n_010, because those are both sequences that could have left us in that state. So as we take some path through the Mixed-State Presentation, we build up a path. If we start at n_s and follow the 1 path, we get to n_1. If we then follow the 0 path, we reach n_10. If we then follow the 0 path, the next node could be called n_100, reflecting the path we've taken. But in fact any path that ends with 00 will reach that node, so it's just labeled n_00. So initially it seems as though we can just append the symbol emitted by whichever path we take, but often there's some step where that breaks down and you get what initially seems like a totally random different label.

buck on Express interest in an "FHI of the West"

(I work out of Constellation and am closely connected to the org in a bunch of ways)

I think you're right that most people at Constellation aren't going to seriously and carefully engage with the aliens-building-AGI question, but I think describing it as a difference in culture is missing the biggest factor leading to the difference: most of the people who work at Constellation are employed to do something other than the classic FHI activity of "self-directed research on any topic", so obviously aren't as inclined to engage deeply with it.

I think there also is a cultural difference, but my guess is that it's smaller than the effect from difference in typical jobs.

wassname on Evolution did a surprising good job at aligning humans...to social status

We establish institutions to channel and utilize status-seeking behavior by putting us in status conscious groups where we have ceremonies and titles that draw our attention to status. This work! Is it more effective to educate a child individually or in a group of peers? Is it easier to lead a solitary soldier or a whole squad? Do people seek a promotion or a pay rise?

From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.

We also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.

LessWrong 2.0 Reader

Archive

Recent comments