LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
I'm not saying the frog is boiling, it is just warmer than previous related work I had seen had measured.
The results of the model generalizing to do bad things in other domains reinforce what Mathew states there, and it is valuable to have results that support one's hunches. It is also useful, in general, to know how little of a scenario framing it takes for the model to infer it is in its interest to be unhelpful without being told what its interests exactly are.
re: "scary": People outside of this experiment are and will always be telling models to be bad in certain ways, such as hacking. Hacking LLMs are a thing that exists and will continue to exist. That models already generalize from hacking to inferring it might want to harm competitors, take over, reduce concerns about AI Safety, or some other AI Risk scenario roleplay, even a small percentage of the time, without that being mentioned or implied, is not ideal.
quinn-dougherty on Quinn's ShortformThinking about a top-level post on FOMO and research taste
Idk maybe this shortform is most of the value of the top level post
ann-brown on What's up with all the non-Mormons? Weirdly specific universalities across LLMsOn the other end of the spectrum, asking cosmo-1b (mostly synthetic training) for a completion, I get `A typical definition of "" would be "the set of all functions from X to Y".`
bhauth on hydrogen tube transportUsing sliding electrical contacts for power is fine for current high-speed trains, but it doesn't work as well above 200 m/s.
nina-rimsky on Progress Update #1 from the GDM Mech Interp Team: Full UpdateI expect if you average over more contrast pairs, like in CAA (https://arxiv.org/abs/2312.06681), more of the spurious features in steering vectors are cancelled out leading to higher quality vectors and greater sparsity in the dictionary feature domain. Did you find this?
nora-belrose on Inducing Unprompted Misalignment in LLMsUnclear why this is supposed to be a scary result.
"If prompting a model to do something bad generalizes to it being bad in other domains, this is also evidence for the idea that prompting a model to do something good will generalize to it doing good in other domains" - Matthew Barnett
buck on Express interest in an "FHI of the West"I'll also note that if you want to show up anywhere in the world and get good takes from people on the "how aliens might build AGI" question, Constellation might currently be the best bet (especially if you're interested in decision-relevant questions about this).
eggsyntax on Transformers Represent Belief State Geometry in their Residual StreamI struggled with the notation on the figures; this comment tries to clarify a few points for anyone else who may be confused by it.
@Adam Shai [LW · GW] please correct me if I got any of that wrong!
Here's the details if you still want them after you've understood the rest. Each node label represents some path that could be taken to that node (& not to other nodes), but there can be multiple such paths. For example, n_11 could also be labeled as n_010, because those are both sequences that could have left us in that state. So as we take some path through the Mixed-State Presentation, we build up a path. If we start at n_s and follow the 1 path, we get to n_1. If we then follow the 0 path, we reach n_10. If we then follow the 0 path, the next node could be called n_100, reflecting the path we've taken. But in fact any path that ends with 00 will reach that node, so it's just labeled n_00. So initially it seems as though we can just append the symbol emitted by whichever path we take, but often there's some step where that breaks down and you get what initially seems like a totally random different label.
(I work out of Constellation and am closely connected to the org in a bunch of ways)
I think you're right that most people at Constellation aren't going to seriously and carefully engage with the aliens-building-AGI question, but I think describing it as a difference in culture is missing the biggest factor leading to the difference: most of the people who work at Constellation are employed to do something other than the classic FHI activity of "self-directed research on any topic", so obviously aren't as inclined to engage deeply with it.
I think there also is a cultural difference, but my guess is that it's smaller than the effect from difference in typical jobs.
wassname on Evolution did a surprising good job at aligning humans...to social statusWe establish institutions to channel and utilize status-seeking behavior by putting us in status conscious groups where we have ceremonies and titles that draw our attention to status. This work! Is it more effective to educate a child individually or in a group of peers? Is it easier to lead a solitary soldier or a whole squad? Do people seek a promotion or a pay rise?
From this perspective, our culture and inclination for seeking status have developed in tandem, making it challenging to determine which influences the other more. However, it appears that culture progresses more rapidly than genes, suggesting that culture conforms to our genes, rather than the reverse.
We also waste a lot of effort on status, which seems like a nonfunctional drive. People will compete for high status professions like musician, streamer, celebrity and most will fail, which makes it seem like an unwise investment of time. This seems misaligned, as it's not adaptive.