fiora-from-rosebloom

Posts
Comments

Posts

Another argument against maximizer-centric alignment paradigms 2024-09-22T07:28:27.856Z

Comments

Comment by Fiora from Rosebloom on Sapphire Shorts · 2024-12-11T03:03:53.607Z · LW · GW

in olivia's case, it seems like the algorithm she's running lately is roughly to try and make herself out as an authority to basically all of the late teens/early 20s transfem rationalists in a particular social circle. (we sometimes half-joking call outselves lgbtescreal, a name due to the beloved user tetraspace). i've heard it claimed by someone else in this community that olivia has bragged about achieving mod status in various discord servers we've been in, and derisively referred to us as "the 19 year olds" who she was nonetheless trying to gain influence over. i think olivia roughly just wants to be seen as powerful and influential, and (being transfem herself, and having a long history in the core rationality community) has an easy time influencing young rationalist transfems in particular.

Comment by Fiora from Rosebloom on Sapphire Shorts · 2024-12-07T08:48:42.218Z · LW · GW

my view is that this particular vassarite is probably a fair amount more harmful than most, though i don't actually know any others very closely

Comment by Fiora from Rosebloom on OpenAI Email Archives (from Musk v. Altman and OpenAI blog) · 2024-11-18T01:58:06.186Z · LW · GW

does anyone have other examples of documents like this, records of communications that shaped the world? it feels somewhat educational, seeing what it looks like when powerful people are doing the things that make them powerful.

Comment by Fiora from Rosebloom on [deleted post] 2024-09-30T00:04:41.097Z

that's a really good way of putting it yeah, thanks.

and then, there's also something in here about how in practice we can approximate the evolution of our universe with our own abstract predicctions well enough to understand the process by which the physical substrate which is getting tripped up by a self-reference paradox, is getting tripped up. which is the explanation for why we can "see through" such paradoxes.

Comment by Fiora from Rosebloom on Simulators · 2024-09-20T05:32:44.327Z · LW · GW

If one were to distingush between "behavioral simulators" and "procedural simulators", the problem wouold vanish. Behavioral simulators imitate the outputs of some generative process; procedural simulators imitate the details of the generative process itself. When they're working well, base models clearly do the former, even as I suspect they don't do the latter.

Comment by Fiora from Rosebloom on Frame Control · 2024-05-20T04:45:54.887Z · LW · GW

I've thought about this post a lot, and I think one thing I might add to its theoretical framework is a guess as to why this particular pattern of abuse shows up repeatedly. The post mentions that you can't look at intent when diagnosing frame control, but that's mostly in terms of intentions the frame controller is willing to admit to themself; there's still gonna be some confluence of psychological factors that makes frame control an attractor in personality-space, even if frame controllers themselves (naturally) have a hard time introspecting about it. My best guess is that some of the core tactics of frame control, for example taking advantage of people's heuristics about what's valuable in social behavior to sneak harmful behavior under the rug, is a strategy for elevating the frame controller's self-esteem, which they 1) stumble into by random chance or imitation of other frame controllers or whathaveyou, 2) find rewarding enough to compell them to keep doing it, and 3) never get called out on it because people are generally scared of questioning the virtues the frame controller is relying on to elevate their social standing. (This is also one reason it'd be hard for frame controllers to introspect about them getting into the habit of using the strategy to start with, in addition to the fact that their reliance on this strategy becomes a pillar of their self-esteem.) A concrete example of a virtue-heuristic a frame controller might take advantage of is the idea that people should be honest; I once dealt with a frame controller who subtly made people feel bad all the time for not highlighting all the tiny ways they were constantly signaling to each other in conversations, and got away with it because the idea that being honest is good is taken as a sacred virtue (because in many/most contexts it produces value!), even though subtle signaling in particular stuff is so utterly pervasive and foundational to how humans relate to each other socially that to aspire to never slip it under the rug is not only impossible but very stressful and humiliating. Other behaviors we treat as like, sacredly virtuous can be used as smoke-screans for attempts to gain status by pointing out behavior that's actually reasonable but has a faint unvirtuous aspect too, honesty isn't the only sacred virtue here; the important thing is just that it's the type of thing people feel uncomfortable with claiming to be bad, actually, thereby keeping both frame controllers and their victims from analyzing what's going on, and keeping the frame controller in a positive feedback loop wrt their abusive behavior.

Comment by Fiora from Rosebloom on Transformers Represent Belief State Geometry in their Residual Stream · 2024-05-04T02:31:23.956Z · LW · GW

We look in the final layer of the residual stream and find a linear 2D subspace where activations have a structure remarkably similar to that of our predicted fractal. We do this by performing standard linear regression from the residual stream activations (64 dimensional vectors) to the belief distributions (3 dimensional vectors) which associated with them in the MSP.

Naive technical question, but can I ask for a more detailed description of how you go from the activations in the residual stream to the map you have here? Or like, can someone point me in the direction of the resources I'd need to undestand? I know that the activations in any given layer of an NN can be interpreted as a vector in a space the same number of dimensions as there are neurons in that layer, but I don't know how you map that onto a 2D space, esp. in a way that maps belief states onto this kind of three-pole system you've got with the triangles here.

User info

Posts

Comments