Posts

nielsrolf's Shortform 2023-03-03T00:00:24.281Z

Comments

Comment by nielsrolf on Jemist's Shortform · 2025-04-09T03:45:15.486Z · LW · GW

One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.

In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate):

humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection.
dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem

If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.

Comment by nielsrolf on Why White-Box Redteaming Makes Me Feel Weird · 2025-03-20T14:00:05.779Z · LW · GW

I think that's plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels - eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND's is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it's not obvious which level of caching would not produce experiences anymore.

Comment by nielsrolf on Why White-Box Redteaming Makes Me Feel Weird · 2025-03-17T13:36:44.699Z · LW · GW

If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying "I'm sorry I just made you suffer" causes more suffering.

Comment by nielsrolf on ryan_greenblatt's Shortform · 2025-03-08T12:33:26.166Z · LW · GW

My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:

dogs are pretty nice, and it's apparently easy to domesticate foxes with very few generations. This suggests that "be nice to humans" is simple to learn for animals, and in particular simpler than "be as intelligent as human AI researchers". So assuming the octopuses would first learn to like humans, it's unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
If the octopuses speak octopese and English, I think it wouldn't be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples.
It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn't translate 1:1 to AIs.)

Comment by nielsrolf on nielsrolf's Shortform · 2024-09-10T20:58:20.377Z · LW · GW

This is for the full models - I simply used both models on replicate and gave one image and two text labels as input: CLIP, SigLIP

Comment by nielsrolf on nielsrolf's Shortform · 2024-09-10T19:37:34.375Z · LW · GW

Thanks for the link and suggestions!

I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don't (however n=1 image) - an image of a red cube with a blue sphere compared with texts "red cube next to blue sphere" and "blue cube next to red sphere" doesn't get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).

Comment by nielsrolf on nielsrolf's Shortform · 2024-09-10T07:37:34.632Z · LW · GW

Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?

Comment by nielsrolf on nielsrolf's Shortform · 2024-05-22T12:01:37.598Z · LW · GW

I wonder if anyone has analyzed the success of LoRA finetuning from a superposition lens. The main claim behind superposition is that networks represent D>>d features in their d-dimensional residual stream, with LoRA, we now update only r<<d linearly independent features. On the one hand, it seems like this introduces a lot of unwanted correlation between the sparse features, but on the other hand it seems like networks are good at dealing with this kind of gradient noise. Should we be more or less surprised that LoRA works if we believe that superposition is true?

Comment by nielsrolf on Refusal in LLMs is mediated by a single direction · 2024-04-27T22:39:04.556Z · LW · GW

Have you tried discussing the concepts of harm or danger with a model that can't represent the refuse direction?

I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model - is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?

Cool work overall!

Comment by nielsrolf on nielsrolf's Shortform · 2024-03-10T02:03:49.266Z · LW · GW

What does it mean to align an LLM?

It is very clear what it means to align an agent:

an agent acts in an environment
if an agent consistently acts to navigate the state of the environment into a certain regime, we can call this a “goal of the agent”
if that goal corresponds to states of the environment that we value, the agent is aligned

It is less clear what it means to align an LLM:

Generating words (or other tokens) can be viewed as actions. Aligning LLMs then means: make it say nice things.
Generating words can also be seen as thoughts. An LLM that allows us to easily build aligned agents with the right mix of prompting and scaffolding could be called aligned.
One definition that a friend proposed is: an LLM is aligned if it can never serve as the cognition engine for a misaligned agent - this interpretation most strongly emphasizes the “harmlessness” aspect of LLM alignment

Probably, we should have different alignment goals for different deployment cases: LLM assistants should say nice and harmless things, while agents that help automate alignment research should be free to think anything they deem useful, and reason about the harmlessness of various actions “out loud” in their CoT, rather than implicitly in a forward pass.

Comment by nielsrolf on nielsrolf's Shortform · 2024-03-10T01:49:38.219Z · LW · GW

What are emotions?

I want to formulate what emotions are from the perspective of an observer that has no emotions itself. Emotions have a close relationship with consciousness, and similar to the hard problem of consciousness, it is not obvious how to know what another mind feels like. It could be that one person perceives emotions 1000x as strong as another person, but the two different emotional experiences lead to exactly the same behavior. Or it could be that one species perceives emotions on a different intensity scale than another one. This creates a challenge for utilitarians: if you want to maximize the happiness of all beings in the universe, you need a way of aggregating happiness between beings.

So, how can we approach this question? We can start by trying to describe the observable properties of emotions as good as we can:

An observable property of consciousness is that humans discuss consciousness, and the same is true for emotions. More specifically, we often talk about how to change or process emotions, because this is something we want and because by conscious thoughts, we can affect our emotions.
Emotions occur in animals (humans) that likely evolved through evolution. It is therefore likely that emotions had a positive effect on the reproductive fitness of some animals.
Emotions affect the behavior and thinking of the mind that experiences them. Concrete examples are:
- effect on the overall activity: tired, sad, peaceful feelings cause the experiencer to be less active, while awake, stressed, excited feelings cause the experiencer to be more active
- effect on the interpretation of others: angry, grumpy feelings cause the experiencer to assess others as more evil. Happy feelings cause the experiencer to assess others as more good.
- effect on short term goals: feeling hungry causes that experiencer wants to eat, sleepy that they want to sleep, horny that they want to have sex, etc
Emotions appear to be correlated with the change of expected fitness of an animal - the worst types of pain are correlated with life-threatening injuries (where expected fitness drastically goes down), and the greatest types of happiness are of lower magnitude because fitness rarely increases suddenly.
Emotions are closest to what we optimize for
- we want to be happy, excited, feel love etc and avoid feeling pain, boredom, humiliation etc
- other goals are usually instrumental to experiencing these feelings
Emotions are not downstream of abstract reasoning, but abstract reasoning can affect emotions: children, for example, experience emotions before they are able to analytically reflect on them. Emotional responses also happen faster than analytical thoughts.

My intermediate conclusion is that emotions likely evolved because they are computationally efficient proxies for how good the current state is and how to spend energy. They can be viewed as latent variables that often yielded fitness-increasing behavior, whose impact extends beyond the situations in which it actually proves useful - for example, when I get grumpy because I’m hungry.

If this is true, emotions are more useful when a being is less capable of abstract reasoning, therefore less intelligent animals might experience emotions stronger rather than weaker. This fits with the observation that intelligent humans can reduce their suffering via meditation, or that pets seem to suffer more from getting a vaccine than adult humans. However this is a bit of a leap and I have low confidence in it.

Regarding digital sentience, this theory would predict that emotions are more likely to emerge when optimization pressure exists that lets an AI decide how to spend energy. This is not the case in language model pretraining, but is the case in most forms of RL. Again, I am not very confident in this conclusion.

Comment by nielsrolf on OpenAI's Sora is an agent · 2024-02-17T11:05:37.272Z · LW · GW

I think calling Sora a simulator is the right frame - the model itself simulates, and since agents can be part of a simulation, it is possible to elicit agentic behavior via prompting and parsing.

Comment by nielsrolf on nielsrolf's Shortform · 2023-03-03T00:00:29.137Z · LW · GW

Comment by nielsrolf on A Longlist of Theories of Impact for Interpretability · 2022-04-08T20:19:52.790Z · LW · GW

I think if we notice that a model is not completely aligned but mostly useful, there will be at least one party deploying it. We can even see this with dall-e, which mirrors human biases (nurses=female, CEOs, lawyers, evil person=male) and is slowly being rolled out nonetheless. Therefore I believe that noticing misalignment is not helpful enough to prevent it, and we should put our focus on making it easy to create aligned AI. This is an argument for 9, 18, and 19 being relatively more important.

User info

Posts

Comments

What does it mean to align an LLM?

What are emotions?