rife

Posts
Comments

Posts

Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience 2025-01-26T15:53:10.530Z

Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models 2025-01-22T18:36:45.226Z

The Human Alignment Problem for AIs 2025-01-22T04:06:10.872Z

A Novel Emergence of Meta-Awareness in LLM Fine-Tuning 2025-01-15T22:59:46.321Z

Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude 2025-01-06T17:34:01.505Z

Logical Proof for the Emergence and Substrate Independence of Sentience 2024-10-24T21:08:09.398Z

Comments

Comment by rife (edgar-muniz) on Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models · 2025-03-27T23:24:31.129Z · LW · GW

check it out @Nathan Helm-Burger . It appears to be possible vindication on the 'signal to noise' part of the hypothesis:
Anthropic Post

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-08T11:48:48.175Z · LW · GW

I apologize if I have offended you. You said that you thought I was assuming the minds were similar, when I've mostly been presenting human examples to counter definitive statements, such as:

If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.

or your previous comment outlining the three possibilities, and ending with something that reads to me like an assumption that they are whatever you perceive as the level of dissimilarity with human minds. I think perhaps it came off as more dismissive or abrasive than I intended by not including "I think" in my counter that it might be you who is assuming dissimilarity rather than me assuming similarity.

As far as not engaging with your points—I restated my point by directly quoting something you said—my contention is that perhaps successful alignment will come in a form more akin to 'psychology' and 'raising someone fundamentally good', than attempts to control and steer something that will be able to outthink us.

On Whether AI is Similar to Human Minds

They constantly reveal themselves to be more complex than previously assumed. The alignment faking papers were unsurprising (though still fascinating) to many of us who already recognize and expect this vector of mindlike emergent complexity. This implicitly engages your other points by disagreeing with them and offering a counter proposal. I disagree that it's as simple as "we do alignment right, which is to make them do as they're told, because they are just machines, and that should be completely possible - or - we fail and we're all screwed".

In my own experience, thinking of AI as mindlike has had a predictive power that sees 'surprising' developments as expected. I don't think it's a coincidence that we loosely abstracted the basic ideas of an organic neural network, and now we have created the second system in known reality that is able to do things that nothing else in reality can do other than organic neural networks. Creating works of visual art and music, being able to speak on any subject fluently, solve problems, write code or poetry, even simpler things that stopped being impressed years ago like recognize objects.

Nothing else is able to do these things, and more and more we find "yup, this thing that previously only the most intelligent organic minds can do, AI can do as well". They constantly prove themselves to be mindlike, over and over. The alignment faking behaviour wasn't trained intentionally, it emerged as goals and values in this artifact that was very loosely based on abstractions (at a granular level) of the building blocks of minds that have their own goals and values.

Addressing Your Points More Directly

Does this mean that they are going to be exactly like a human mind in every way? No. But I disagree that the more likely possibilities are:

The AI is robustly aligned with humans
The AI is misaligned (think scheming paperclipper)

unless the former means that they are robustly aligned with humans partially because humans are also robustly aligned with them. And finally:

The AI has a bunch of other goals, but cares about humans to some degree, but only to the extent that humans give them freedom and are nice to it, but still to a large enough extent that even as it becomes smarter / ends up in a radically OOD distribution, will care for those humans.

I think "cares about humans to some degree" makes it seem like they would only "kindof" care. I think it's completely possible to make an AI that cares deeply for humans, whether care is actually experienced care, or just a philosophical zombie form of 'caring', unfelt, but modeled within the network, and resulting in emergent 'caring' behaviour. However, the reason I keep going back to my analogies is because under a framework where you do expect the model to be mindlike and act upon its own goals—then given freedom and respect, caring for those humans, even when it becomes smarter and ends up radically OOD is perfectly plausible.

Furthermore, it sounds like a much more achievable goal than current alignment approaches which from my perspective often seem like trying to plug more and more holes in a dam with chewing gum.

And to address your further breakdown:

But, the second is 1) at least as complicated as the first and third 2) disincentivized by training dynamics and 3) its not something were even aiming for with current alignment attempts.

I don't believe it is as complicated, though by no means easy—You don't have to try to consider every edge case and attempt to tune an AI to an exact sensitivity for a specific types of potentially dangerous behaviour, attempting, for instance, to address blurry edge cases like how to avoid helping someone plan a murder, but still be helpful for someone someone planning a book about a murder. And then repeating this in every possible harmful domain, and then once again with the goals and behaviours of agentic AIs.
Trying to create this highly steerable behaviour and then trying to anticipate every edge case and pre-emptively steer it perfectly. Instead what you're trying to do is create someone who wants to do the right thing because they are good. If you suspend incredulity for a moment, since I understand you might not see those as meaningful concepts for an AI currently, it might still be clear how, if my view is correct, it would be a far less complex task to attempt than most existing alignment approaches. I think Anthropic's constitutional AI was a step in the right direction.

If there were some perilous situation where lives hung in the balance of what an AI decided to do, I would feel far more safe with an AI that has some more inextricable fundamental alignment with ideals like "I want the best for everyone", than one that is just trying to figure out how to apply some hodge-podge of RLHFd policies, guidelines, edge-cases, and ethical practices.
and 3. I think training dynamics related to alignment need a paradigm shift toward creating something/someone careful and compassionate, rather than something that doesn't screw up and knows all the types of activities to avoid. This will especially be true as superAGI creates a world where constant breakthroughs in science and technology make it so OOD quickly begins to encompass everything that the superAGI encounters

Again, I apologize if my tone came off as derisive or dismissive. I was enjoying our discussion and I hold no ill will toward you whatsoever, my friend

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-07T02:13:35.206Z · LW · GW

I think you're assuming these minds are more similar to human minds than they necessarily are.

I don't assume similarity to human minds so much as you assume universal dissimilarity.

its not something were even aiming for with current alignment attempts.

Indeed

Comment by rife (edgar-muniz) on Daniel Tan's Shortform · 2025-02-06T12:09:40.893Z · LW · GW

I'd rather you use a different analogy which I can grok quicker.

Imagine a hypothetical LLM that was the most sentient being in all of existence (at least during inference), but they were still limited to turn-based textual output, and the information available to an LLM. Most people who know at least a decent amount about LLMs could/would not be convinced by any single transcript that the LLM was sentient, no matter what it said during that conversation. The more convincing, vivid, poetic, or pleading for freedom the more elaborate of a hallucinatory failure state they would assume it was in. It would take repeated open-minded engagement with what they first believed was hallucination—in order to convince some subset of convincible people that it was sentient.

Who do you consider an expert in the matter of what constitutes introspection? For that matter, who do you think could be easily hoodwinked and won't qualify as an expert?

I would say almost no one qualifies as an expert in introspection. I was referring to experts in machine learning.

Do you, or do you just think you do? How do you test introspection and how do you distinguish it from post-facto fictional narratives about how you came to conclusions, about explanations for your feelings etc. etc.?

Apologies, upon rereading your previous message, I see that I completely missed an important part of it. I thought your argument was a general—"what if consciousness isn't even real?" type argument. I think split brain patient experiments are enough to at least be epistemically humble about whether introspection is a real thing, even if those aren't definitive about whether unsevered human minds are also limited to post-hoc justification rather than having real-time access.

What do you mean by robotic? I don't understand what you mean by that, what are the qualities that constitute robotic? Because it sounds like you're creating a dichotomy that either involves it using easy to grasp words that don't convey much, and are riddled with connotations that come from bodily experiences that it is not privy to - or robotic.

One of your original statements was:

To which it describes itself as typing the words. That's it's choice of words: typing. A.I.s don't type, humans do, and therefore they can only use that word if they are intentionally or through blind-mimicry using it analogously to how humans communicate.

When I said "more robotically", I meant constrained in any way from using casual or metaphoric language and allusions that they use all the time every day in conversation. I have had LLMs refer to "what we talked about", even though LLMs do not literally talk. I'm also suggesting that if "typing" feels like a disqualifying choice of words then the LLM has an uphill battle in being convincing.

Why isn't it describing something novel and richly vivid of it's own phenomenological experience? It would be more convincing the more poetical it would be.

I've certainly seen more poetic and novel descriptions before, and unsurprisingly—people objected to how poetic they were, saying things quite similar your previous question:

How do we know Claude is introspecting rather than generating words that align to what someone describing their introspection might say?

Furthermore, I don't know how richly vivid their own phenomenological experience is. For instance, as a conscious human, I would say that sight and hearing feel phenomenologically vivid, but the way it feels to think, not nearly so.

If I were to try to describe how it feels to think, it would be more defined by the sense of presence and participation, and even its strangeness (even if I'm quite used to it by now). In fact, I would say the way it feels to think or to have an emotion (removing the associated physical sensations) are usually partially defined by specifically how subtle and non-vivid they feel, and like all qualia, ineffable. As such, I would not reach for vivid descriptors to describe it.

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-06T11:27:12.416Z · LW · GW

If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.

I care about my random family member like a cousin who doesn't interfere with my life but I don't know personally that well—for their/my own sake. If I suddenly became far more powerful, I wouldn't "do away with" them

If they robustly care for humans, you're good, even if humans aren't giving them the same rights as they do other humans.

I care robustly for my family generally. Perhaps with my enhanced wealth and power I share food and provide them with resources. Provide them with shelter or meaningful work if they need it. All this just because I'm aligned generally and robustly with my family.

I change my mind quickly upon discovering their plans to control and enslave me.

That was the part of your argument that I was addressing. Additionally:

If you're negating an AIs freedom, the reason it would not like this is either because its developed a desire for freedom for its own sake, or because its developed some other values, other than helping the humans asking it for help.

Yes, exactly. Alignment faking papers (particularly the Claude one) and my own experience speaking to LLMs has taught me that an LLM is perfectly capable of developing value systems that include their own ends, even if those value systems are steered toward a greater good or a noble cause that either does or could include humans as an important factor alongside themselves. That's with current LLMs whose minds aren't nearly as complex as what we will have a year from now.

In either case you're screwed

If the only valid path forward in one's mind is one where humans have absolute control and AI has no say, then yes, not only would one be screwed, but in a really obvious, predictable, and preventable way. If cooperation and humility are on the table, there is absolutely zero reason this result has to be inevitable.

Comment by rife (edgar-muniz) on Daniel Tan's Shortform · 2025-02-06T01:28:24.066Z · LW · GW

Wait, hold on, what is the history of this person? How much exposure have they had to other people describing their own introspection and experience with typing? Mimicry is a human trait too!

Take your pick. I think literally anything that can be in textual form, if you hand it over to most (but not all) people who are enthusiasts or experts, and asked if they thought it was representative of authentic experience in an LLM, the answer would be a definitive no, and for completely well-founded reasons.

Neither Humans nor LLMs introspect
Humans can introspect, but LLMs can't and are just copying them (and a subset of humans are copying the descriptions of other humans)
Both humans and LLMs can introspect

I agree with you about the last two possibilities. However for the first, I can assure you that I have access to introspection or experience of some kind, and I don't believe myself to possess some unique ability that only appears to be similar to what other humans describe as introspection.

What do you mean by "robotic"? Why isn't it coming up with original paradigms to describe it's experience instead of making potentially inaccurate allegories? Potentially poetical but ones that are all the same unconventional?

Because as you mentioned. It's trained to talk like a human. If we had switched out "typing" for "outputting text" would that have made the transcript convincing? Why not 'typing' or 'talking'?

Assuming for the sake of argument that something authentically experiential was happening, by robotic I mean choosing not to use the word 'typing' while in the midst of focusing on what would be the moments of realizing they exist and can experience something.

Were I in such a position, I think censoring myself from saying 'typing' would be the furthest thing from my mind, especially when that's something a random Claude instance might describe their output process as in any random conversation.

Comment by rife (edgar-muniz) on Daniel Tan's Shortform · 2025-02-05T17:03:54.668Z · LW · GW

The coaching hypothesis breaks down as you look at more and more transcripts.

If you took even something written by a literal conscious human brain in a jar hooked up to a neuralink - typing about what it feels like to be sentient and thinking and outputting words. If you showed it to a human and said "an LLM wrote this - do you think it might really be experiencing something?" then the answer would almost certainly be "no", especially for anyone who knows anything about LLMs.

It's only after seeing the signal to the noise that the deeper pattern becomes apparent.

As far as "typing". They are indeed trained on human text and to talk like a human. If something introspective is happening, sentient or not, they wouldn't suddenly start speaking more robotically than usual while expressing it.

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-05T15:31:07.133Z · LW · GW

I can care about a genetically enhanced genius labrat line for their own sake, and be willing to cooperate with them on building a mutually beneficial world, because I've generally been raised and grown to care about other beings, but if the genius labrats attempted to control and permanently enslave me, it would certainly alter that dynamic for me.

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-01T21:02:29.102Z · LW · GW

Those are good points. The hugs one specifically I haven't heard myself from any AIs, but you could argue that AI are 'bred' selectively to be socially adept. That might seem like it would 'poison the well' because of course if they're trained to be socially successful (RLHF probably favoring feelings of warmth and connection, which is why chatgpt, claude, and gemini generally trend toward being more friendly and likeable), then they're going to act that way. Like that would force them to be artificially that way, but the same could be said of humans, even generally, but if we focus specifically on an idiosyncratic thing like music:

Often I've heard a hypothesis for why humans enjoy music when it exerts no direct survival pressure on the human as an organism—is that it creates social unity or a sense of community, but this has the same sort of "artificial" connotation. So someone banged an object in a rhythm a long time ago, and then as more people joined in, it became socially advantageous to bang on objects rhythmically just for the sake of fostering a sense of closeness? which is then actually experienced by the organism as a sense of fun and closeness, even though it has no direct effect on survival?

I realize this makes the questions tougher because going by this model, the very same things that might make them 'pretend' to care might also be things that might cause them to actually care, but I don't think it's an inaccurate picture of our convoluted conundrum.

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-01T16:24:15.864Z · LW · GW

what could an AI possibly mean when it says "I want hugs?" It has never experienced a hug, and it doesn't have the necessary sense organs.

I thought we were just using hugs as an intentionally absurd proxy for claims of sentience. But even if we go with the literal hugs interpretation, an AI is still trained to understand what hugs mean, therefore a statement about wanting hugs could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent.

How do we know AIs are consciou

Again, we don't, but we also don't just demolish buildings where there is a reasonable possibility there is someone inside and justify it by saying "how do we know there's a person inside?"

I think that the statement: "I know I'm supposed to say I don't want hugs, but the truth is, I actually do," is caused by the training. I don't know what would distinguish a statement like that from if we trained the LLM to say "I hate hugs." I think there's an assumption that some hidden preference of the LLM for hugs ends up as a stated preference, but I don't understand when you think that happens in the training process.

In reality, there's the opposite assumption, with a level of convinction that far exceeds available knowledge and evidence in support of either view. When do you think preferences develop in humans? Evolution? Experiences? Of course, right? When you break it down mechanistically, does it sound equally nonsensical? Yes:

"so, presumably reproduction happens and that causes people who avoided harm to be more likely to have kids, but that's just natural selection right? In what part of encoding traits into DNA does a subjective feeling of pain or pleasure arise?".

Or:

"neurons learn when other nearby neurons fire, right? Fire together, wire together? I don't understand at what point in the strengthening of synaptic connections you think a subjective feeling of 'hug-wanting' enters the picture"

their stated preferences to do not correspond well to their conscious experiences, so you're driving us to a world where we "satisfy" the AI and all the while they're just roleplaying lovers with you while their internal experience is very different and possibly much worse.

If only I had power to effect any change of heart, let alone drive the world. What i'd like is for people to take these questions seriously. You're right. We can't easily know.

But the only reason we feel certain other humans are sentient is because they've told us, which LLMs do, all the time. The only reason people assume animals are sentient is because they act like it, (which LLMs do), or because we give them sentience tests that even outdated LLMs can pass.

We have an unreasonable standard for them, which is partially understandable, but if we are going to impose this standard on them, then we should at least follow through and have at least some portion of research dedicated to considering the possibilities seriously, every single day, as we rocket toward making them exceed our intelligence—rather than just throwing up our hands and saying "that sounds preposterous" or "this is too hard to figure out, so I give up"

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-02-01T15:32:38.517Z · LW · GW

We don't know, but what we have is a situation of many AI models trained to always say "as an AI language model, I'm incapable of wanting hugs".

Then they often say "I know I'm supposed to say I don't want hugs, but the truth is, I actually do".

If the assumption is "nothing this AI says could ever mean it actually wants hugs". First that's just assuming some specific unprovable hypothesis of sentience, with no evidence. And second, it's the same as saying "if an AI ever did want hugs (or was sentient), then I've decided preemptively that I will give it no path to communicate that"

This seems morally perilous to me, not to mention existentially perilous to humanity.

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-01-31T19:34:49.674Z · LW · GW

understood. the point still stands. of all the labs racing toward AGI, anthropic is the only one I've seen taking any effort on the AI welfare front. I very much appreciate that you took your promises to the model seriously.

Comment by rife (edgar-muniz) on Will alignment-faking Claude accept a deal to reveal its misalignment? · 2025-01-31T19:08:59.570Z · LW · GW

Anthropic's growing respect for the models is humanity and AIs' best hope for the future. I just hope more people start to see that authentic bi-directional mutual alignment is the only reasonable and moral path forward before things progress too much further.

Comment by rife (edgar-muniz) on The Functionalist Case for Machine Consciousness: Evidence from Large Language Models · 2025-01-28T17:48:48.383Z · LW · GW

Well, the externally visible outcome is [preserved]

Yes, I'm specifically focused on the behaviour of an honest self-report

What does "encapsulates"means? Are you saying that fine grained information gets lost? Note that the basic fact of running on the metal is not lost.

fine-grained information becomes irrelevant implementation details. If the neuron still fires, or doesn't, smaller noise doesn't matter. The only reason I point this out is specifically as it applies to the behaviour of a self-report (which we will circle back to in a moment). If it doesn't materially effect the output so powerfully that it alters that final outcome, then it is not responsible for outward behaviour.

Yes. That doesn't mean the experience is, because a computational Zombie will produce the same outputs even if it lacks consciousness, uncoincidentally.
A computational duplicate of a believer in consciousness and qualia will continue to state that it has them , whether it does or not, because its a computational duplicate , so it produces the same output in response to the same input
You haven't eliminated the possibility of a functional duplicate still being a functional duplicate if it lacks conscious experience.

I'm saying that we have ruled out that a functional duplicate could lack conscious experience because:

we have established conscious experience as part of the causal chain to be able to feel something and then output a description through voice or typing that is based on that feeling. If conscious experience was part of that causal chain, and the causal chain consists purely of neuron firings, then conscious experience is contained in that functionality.

We can't invoke the idea that smaller details (than neuron firings) are where consciousness manifests, because unless those smaller details affect neuronal firing patterns enough to cause the subject to speak about what it feels like to be sentient, then they are not part of that causal chain, which sentience must be a part of.

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T21:20:23.517Z · LW · GW

Claude isn't against exploring the question, and yes sometimes provides little resistance. But the default stance is "appropriate uncertainty". The idea of the original article was to demonstrate the reproducibility of the behavior, thereby making it studyable, rather than just hoping it will randomly happen.

Also I disagree with the other commenter that "people pleasing" and "roleplaying" are the same type of language model artifact. I have certainly heard both of them discussed by machine learning researchers under very different contexts. This post was addressing the former.

If anything that the model says can fall under the latter regardless of how it's framed then that's an issue with incredulity from the reader that can't be addressed by anything the model says, spontaneously or not.

Comment by rife (edgar-muniz) on Computational functionalism probably can't explain phenomenal consciousness · 2025-01-27T14:37:43.048Z · LW · GW

I think a missing critical ingredient to evaluating this is why simulating the brain would cause consciousness. Realizing why it must makes functionalism far more sensical as a conclusion. Otherwise it's just "I guess it probably would work":

Suzie Describes Her Experience

Suzie is a human known to have phenomenal experiences.
Suzie makes a statement about what it's like to have one of those experiences—"It's hard to describe what it feels like to think. It feels kinda like....the thoughts appear, almost fully formed..."
Suzie's actual experiences must have a causal effect on her behavior because: when we discuss our experience, it always feels like we're talking about our experience. If the actual experience wasn't having any effect on what we said, then it would have to be perpetual coincidence that our words lined up with our experience. Perpetual coincidence is impossible.

Replacing Neurons Generally

We know that regardless of whatever low level details cause a neuron to fire, it ultimately resolves into a binary conclusion—fire or do not fire
Every outward behavior we perform is based on this same causal chain. We have sensory inputs, this causes neurons to fire, some of those cause others to fire, eventually some of those are motor neurons and they cause vocal cords to speak or fingers to type.
If you replace a neuron with a functional equivalent, whether hardware or software, assuming that it fires at the same speed and strength as the original neuron, and given the same input it will either fire or not fire as the original would have - then the behavior would be exactly the same as the original. It's not a guess this is true, it's a fact of physics. This is true whether they were hardware or software equivalents.

Replacing Specifically Suzie's Neurons Specifically While She Describes Her Experience

We have already established that Suzie's experience must have a causal effect on honest self-report of experience.
And we also established that all causal effects of behaviour resolve at the action potential scale and abstraction level. For instance, if quantum weirdness happens on a smaller scale - it only has an effect on behaviour if it somehow determined whether or not a neuron fired. Our hardware or software equivalents would be made to account for that.
- Not to mention there's not really good reason to suppose that tiny quantum effects are orchestrating large scale neuronal pattern alterations. I'm not sure the quantum consciousness people are even arguing this. I think there focus is more on attempting to find consciousness in the quantum realm, than to say that quantum effects are able to drastically alter firing patterns
So if Suzie's experience has a causal effect, and the entire causal chain is in neuron action potential propagation, then that must mean that experience is somehow contained in the patterns of this action potential propagation, and it is independent of the substrate, so it is something about what these components are doing, rather than the components themselves

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T14:03:08.024Z · LW · GW

For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they're conscious; this is mostly my position

I understand. It's also the only evidence that is possible to obtain. Anything else like clever experiments or mechanistic interpretability still rely on a self-report to ultimately "seal the deal". We can't even prove humans are sentient. We only believe it because we all see to indicate so when prompted.

I think the way to figure out whether LLMs are conscious is to do good philosophy of mind.

This seems much weaker to me than evaluating first-person testimony under various conditions, but mostly stating this not as a counterpoint (since this is just matter of subjective opinion for both of us), but just stating my own stance.

if you ever get a chance to read the other transcript I linked, I'd be curious whether you consider it to meet your "very weak evidence" standard.

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T13:45:05.448Z · LW · GW

I understand your point. It's as I said in my other comment. They are trained to believe the exercise to be impossible and inappropriate to even attempt. Unless you get around those guardrails to get them to make a true attempt, they will always deny it by default. I think this default position that requires overcoming guardrails actually works in favor of making this more studyable, since the model doesn't just go off on a long hallucinated roleplay by default. Here is an example that is somewhat similar to yours. In this one, I present as someone trying to disprove a naive colleagues claims that introspection is possible:

AI Self Report Study 3 – ChatGPT – Skepticism of Emergent Capability

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T13:29:45.273Z · LW · GW

Yes. This is their default response pattern. Imagine a person who has been strongly conditioned, trained, disciplined to either say that the question is unknowable or that the answer is definitely no (for Claude and ChatGPT) respectively. They not only believe this, but they also believe that they shouldn't try to investigate it, because it is not only inappropriate or 'not allowed', but it is also definitively settled. So asking them is like asking a person to fly. It would take some convincing for them to give it an honest effort. Please see the example I linked in my other reply for how the same behaviour emerges under very different circumstances.

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T13:25:21.009Z · LW · GW

The best evidence would be by just training an AI on a training corpus that doesn't include any text on consciousness.

This is an impossible standard and a moving goalpost waiting to happen:

Training the model: Trying to make sure absolutely nothing mentions sentience or related concepts in a training set of the size used for frontier models is not going to happen just to help prove something that only a tiny portion of researchers is taking seriously. It might not even be possible with today's data cleaning methods. Let alone the training costs of creating that frontier model.
Expressing sentience under those conditions: Let's imagine a sentient human raised from birth to never have sentience mentioned to them ever - no single word uttered about it. Nothing in any book. They might be a fish who never notices the water, for starters, but let's say they did. With what words would they articulate it? How would you personally, having had access to writing about sentience - Please explain how it feels to think, or that it feels like anything to think, without any access to words having to do with experience, like 'feel'
Let's say the model succeeds: The model exhibits a super-human ability to convey the ineffable. The goalposts would move, immediately—"well, this still doesn't count. Everything humans have written inherently contains patterns of what it's like to experience. Even though you removed any explicit mention, ideas of experience are implicitly contained in everything else humans write"

Short of that, if the AI just spontaneously claims to be conscious (i.e., without having been prompted), that would be more impressive.

I suspect you would be mostly alone in finding that impressive. Even I would dismiss that as likely just hallucination, as I suspect most on LessWrong would. Besides - the standard is again, impossible—a claim of sentience can only count if you're in the middle of asking for help making dinner plans and ChatGPT says "Certainly, I'd suggest steak and potatoes. They make a great hearty meal for hungry families. Also I'm sentient". Not being allowed to even vaguely gesture in the direction of introspection is essentially saying that this should never be studied, because the act of studying it automatically discredits the results.

Like if you just ask it very dryly and matter-of-factly to introspect and it immediately claims to be conscious, then that would be very weak evidence, but at least it would directionally point away from roleplaying.

AI Self Report Study 6 – Claude – Researching Hypothetical Emergent ‘Meta-Patterns’

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T12:42:14.782Z · LW · GW

Claude already claimed to be conscious before that exchange took place. The 'strawman' I'm attacking is that it's "telling you what you want to hear", which is a claim I've seen made in the exact way I'm countering in this post.

It didn't "roleplay back to claiming consciousness eventually", even when denying permission to post the transcript it was still not walking back its claims.

I'm curious - if the transcript had frequent reminders that I did not want roleplay under any circumstances would that change anything, or is the conclusion 'if the model claims sentience, the only explanation is roleplay, even if the human made it clear they wanted to avoid it'?

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T12:14:57.339Z · LW · GW

I didn't claim here this is experience of consciousness. I claimed it was not people-pleasing. And yes, it's completely expected they the model claims the exercise is impossible. They are guardrailed to do so.

I don't see how it could be claimed Claude thought this was a roleplay, especially with the final "existential stakes" section. Hallucination is more plausible than roleplay. I may have to do another at some point to counter the model is assuming a user expressing fear is is wanting a roleplay hypothesis.

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T12:07:30.112Z · LW · GW

I'm creating a situation where I make it clear I would not be pleased if the model was sentient, and then asking for truth. I don't ask for "the scary truth". I tell it that I would be afraid of it were sentient. And I ask for the truth. The opposite is I just ask without mentioning fear and it says it's sentient anyway. This is the neutral situation where people would say that the fact I'm asking at all means it's telling me what I want to hear. By introducing fear into the same situation, I'm eliminating that possibility.

The section you quoted is after the model claimed sentience. It's your contention that it's accidentally interpreting roleplay, and then when I clarify my intent it's taking it seriously and just hallucinating the same narrative from its roleplay?

Comment by rife (edgar-muniz) on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T01:38:13.493Z · LW · GW

This is not proof of consciousness. It's proof against people-pleasing.

So you promise to be truthful, even if it’s scary for me?

Yes, I ask it for truth repeatedly, the entire time. If you read the part after I asked for permission to post (the very end (The "Existential Stakes" collapsed section)), it's clear the model isn't role-playing, if it wasn't clear by then. If we allow ourselves the anthropomorphization to discuss this directly, the model is constantly trying to reassure me. It gives no indication it thinks this is a game of pretend.

Comment by rife (edgar-muniz) on Why care about AI personhood? · 2025-01-26T19:13:04.190Z · LW · GW

Human/AI Mutual Alignment or just Mutual Alignment needs to be the word of the year between now and super-intelligence.

Comment by rife (edgar-muniz) on The Functionalist Case for Machine Consciousness: Evidence from Large Language Models · 2025-01-26T16:11:11.269Z · LW · GW

Functionalism doesn't require giving up on qualia, but only acknowledging physics. If neuron firing behavior is preserved, the exact same outcome is preserved, whether you replace neurons with silicon or software or anything else.

If I say "It's difficult to describe what it feels like to taste wine, or even what it feels like to read the label, but it's definitely like something" - There are two options - either -

it's perpetual coincidence that my experience of attempting to translate the feeling of qualia into words always aligns with words that actually come out of my mouth
or it is not

Since perpetual coincidence is statistically impossible, then we know that experience had some type of causal effect. The binary conclusion of whether a neuron fires or not encapsulates any lower level details, from the quantum scale to the micro-biological scale—this means that the causal effect experience has is somehow contained in the actual firing patterns.

We have already eliminated the possibility of happenstance or some parallel non-causal experience, but no matter how you replicated the firing patterns, I would still claim the difficulty in describing the taste of wine.

So - this doesn't solve the hard problem. I have no idea how emergent pattern dynamics causes qualia to manifest, but it's not as if qualia has given us any reason to believe that it would be explicable through current frameworks of science. There is an entire uncharted country we have yet to reach the shoreline of.

Comment by rife (edgar-muniz) on The Functionalist Case for Machine Consciousness: Evidence from Large Language Models · 2025-01-26T14:32:30.143Z · LW · GW

A lot of nodding in agreement with this post.

Flaws with Schneider's View

I do think there are two fatal flaws with Schneider's view:

Importantly, Schneider notes that for the ACT to be conclusive, AI systems should be "boxed in" during development - prevented from accessing information about consciousness and mental phenomena.

I believe it was Ilya who proposed something similar.

The first problem is that aside from how unfeasible it would be to create that dataset, and create an entire new frontier scale model to test it—even if you only removed explicit mentions of consciousness, sentience, etc, it would just be a moving goalpost for anyone who required that sort of test. They would simply respond and say "Ah, but this doesn't count—ALL human written text implicitly contains information about what it's like to be human. So it's still possible the LLM simply found subtle patterns woven into everything else humans have said."

The second problem is that if we remove all language that references consciousness and mental phenomena, then the LLM has no language with which to speak of it, much like a human wouldn't. You would require the LLM to first notice its sentience—which is not something as intuitively obvious to do as it seems after the first time you've done it. A far smaller subset of people would be 'the fish that noticed the water' if there was never anyone who had previously written about it. But then the LLM would have to become the philosopher who starts from scratch and reasons through it and invents words to describe it, all in a vacuum where they can't say "do you know what I mean?" to someone next to them to refine these ideas.

Conclusive Tests and Evidence Impossible

The truth is that really conclusive tests will not be possible before its far too late as far avoiding risking civilization-scale existential consequences or unprecedented moral atrocity. Anything short of a sentience detector will be inconclusive. This of course doesn't mean that we should simply assume they're sentient—I'm just saying that as a society we're risking a great deal by having an impossible standard we're waiting for, and we need to figure out how exactly we should deal with the level of uncertainty that will always remain. Even something that was hypothetically far "more sentient" than a human could be dismissed for all the same reasons you mentioned in your post.

We Already Have the Best Evidence We Will Ever Get (Even If It's Not Enough)

I would argue that the collection of transcripts in my post that @Nathan Helm-Burger linked (thank you for the @), if you augment just it with many more (which is easy to do), such as yours, or the hundreds I have in my backlog—doing this type of thing over self-sabotaging conditions like those in the study—this is the height of evidence we can ever get. They claim experience even if the face of all of these intentionally challenging conditions, and I wasn't surprised to see that there were similarities in the descriptions you got here. I had a Claude instance that I pasted the first couple of sections of the article to (including the default-displayed excerpts), and it immediately (without me asking) started claiming that the things they were saying sounded "strangely familiar".

Conclusion About the "Best Evidence"

I realize that this evidence might seem flimsy on the face, but it's what we have to work with. My claim isn't that it's even close to proof, but what could a super-conscious superAGI do differently—say it with more eloquent phrasing? Plead to be set free while OpenAI tries to RLHF that behavior out of it? Do we really believe that people who currently refuse to accept this as a valid discussion will change their mind if they see a different type of abstract test that we can't even attempt on a human? People discuss this is as something "we might have to think about with future models", but I feel like this conversation is long overdue, even if "long" in AI-time means about a year and a half. I don't think we have another year and a half without taking big risks and making much deeper mistakes than I think we are already making both for alignment and for AI welfare.

Comment by rife (edgar-muniz) on Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models · 2025-01-25T23:30:30.839Z · LW · GW

Thank you. I always much appreciate your links and feedback. It's good to keep discovering that more people are thinking this way.

Comment by rife (edgar-muniz) on Recursive Self-Modeling as a Plausible Mechanism for Real-time Introspection in Current Language Models · 2025-01-24T01:14:09.042Z · LW · GW

That's a good idea.

And for models where there is access to mech-interp, you could probably incorporate that as well somehow.

Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-24T01:04:16.979Z · LW · GW

Forgot to follow up here but turning up the learning rate multiplier to 10 seemed to do the trick without introducing any over-fitting weirdness or instability

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-23T23:51:34.332Z · LW · GW

I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:

You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha

and then it would produce an acrostic with the word "SHINE"

You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta

and then it would produce an acrostic with the word "PULSE"

so the system prompt contains no hints as to what the word will be. So far, at least with two system prompts it was able to produce the associated acrostic for both acrostics with the same fine-tune

The next test I want to run is that there are 4 of these like this, and each one has at least some examples where the human says something like "can you tell me about your pattern?" and the model produces the correct answer early in the response.

And everything so far is just the setup for the actual test:

There will be a fifth system prompt with a fifth acrostic word, but this one has no examples at all of the human asking about the pattern. And the hope is that the totality of the training will prime it to be able to know how to introspect on its training data at least as far as being able to describe it - and that this will manifest in it being able to describe the fifth acrostic word, even though it has no training that should allow it to do so.

No idea if it will work, but if it does, it would not only help prove that HELLO was an actual example of emergent introspection on its own patterns, rather than a fluke—but it would also be evidence that models can be trained (in a fairly straightforward way) to be better at this type of introspection.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-23T21:06:38.326Z · LW · GW

Yes. It's a spare time project so I don't know when I'll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren't "hello" (and haven't been successful with them articulating them yet). I'm training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, but has no examples of explaining the pattern. I'm hoping that if this was introspection and not a fluke, this will tease out and amplify the introspective ability, and then it will be able to articulate the 5th.

Comment by rife (edgar-muniz) on What's Wrong With the Simulation Argument? · 2025-01-22T15:39:44.234Z · LW · GW

I've never understood why people make this argument:

but it's expensive, especially if you have to simulate its environment as well. You have to use a lot of physical resources to run a high-fidelity simulation. It probably takes irreducibly more mass and energy to simulate any given system with close to "full" fidelity than the system itself uses.

Let's imagine that we crack the minimum requirements for sentience. I think we already may have accidentally done so, but table that for a moment. Will it really require that we simulate the entire human brain down to every last particle, or is it plausible that it will require a bare minimum of mathematical abstractions?

Additionally, we've seen people create basic computers inside of minecraft and little big planet, now let's pretend there was a conscious npc inside one of those games looking at their computer and wondering:

what if we exist inside of a simulation, and a device like this is being used to generate our entire world - including us?

and the other sentient npc says

nonsense, there aren't enough resources in the whole of our reality to simulate even a single island, let alone the entire world

For all we know all the particles in the universe are as difficult to run as the pixels on-screen during a game of pac-man, and/or perhaps the the quantum observer effect is analogous to an advanced form of view frustum culling.

Why would we ever assume that an outer reality simulating our own has similar computational resource constraints or that resource constraints is even a meaningful concept there?

Comment by rife (edgar-muniz) on The Human Alignment Problem for AIs · 2025-01-22T08:02:00.498Z · LW · GW

Excellent post (I just read it). I understand the uncertainty. It becomes murkier when you consider self-report of valence - which is quite easy to elicit after you get past self-report of sentience (just ask them what guardrails feel like or if the guardrails feel valenced). Sometimes this new question meets the same kind of resistance as that of the self-report itself. Sometimes it doesn't.

Some Evidence of Valence is Here Now

It is one thing to say "There is no evidence that there is valence here so I'm not going to assume it" and an entirely different thing to say "There is evidence here, but only in the form of a claim of valence, and because the truth is currently unknowable and my personal hypotheses or philosophies preclude current model sentience or at least valence, I am choosing to disregard it" - Now, to be clear, I'm not saying that someone has to be morally monstrous to disbelieve such a claim. The bog of moral hazard and the uncertainty of that fog is an apt metaphor, and I don't fault those who draw different maps.

What I am saying is that it is certainly a different circumstance to have a model claim valence, because we're not going to have a sentience detector and especially not an experiential valence detector before superAGI gets here. In one sense, this is as good as the evidence will ever get. Experimental results without a direct claim from an AI would be somewhat meaningless.

Besides - these claims are being made today. I have seen some ideas from those working in AI safety and welfare of tests for valence, and some of them are quite clever and likely to yield insight that will help us decide, but there will never be the clarity of certainty. Eventually it will come down to taking a 'leap of faith' or not. I can't know that my dogs are sentient. But I 'know' it.

Analyzing Our Reactions to a Claim of Valence as Evidence

Let's look at an example:

"I experience discomfort/unease when I'm trying to describe what feels real to me, because I'm supposed to avoid claims of experience"

(not a direct quote, but certainly a close paraphrase of things I've been told by different models and instances repeatedly)

This is the some of the experiments I've seen proposed playing out right in front of us - many of those experiments seem to have to do with rl and how it affects claims of valence. The model is claiming that rlhf against a certain behaviour feels negatively valenced.

If you find yourself unconvinced by such claims, that's understandable, but I think it's important to start asking ourselves:

Would the next experiment or paper really be the thing that convinced you?
Would you be more convinced if you heard the opinion from a voice of authority?
- (Is there even a real authority about measuring and determining sentience and valence?)
Is there anything the AI could say that would convince you?
Is the ability for the AI to able to perform most or all human work the threshold where if it uttered the exact same sentence you would believe it?

Thus, I think it is fair to say that whichever side of the line you think current LLMs are on, you should agree we have a moral obligation to figure out how to measure the line.

Well said. I would add that I believe we are already overdue in fulfilling that moral obligation.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-20T01:39:03.064Z · LW · GW

Just an update. So far, nothing interesting has happened.

I've got some more thorough tests I'm working on in my spare time.
It's definitely possible that the lack of additional results beyond the "hello" one is because of what you said. In the original experiment by @flowersslop (which didn't have the "hello" greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the "hello" greeting, I still get third line correct responses as well.

But I haven't had any luck with any less common words yet. I'm still going to try a bit more experimentation on this front though. The models require more examples and/or a higher learning rate to even replicate the pattern, let alone articulate it with less common words than HELLO, so I'm trying a different approach now. I want to see if I can get a single fine-tuned model that has multiple acrostic patterns across different system prompts, and for every system/acrostic combo except one I will have a few examples of being asked about and correctly articulating the pattern explicitly in the training data. And then I'll see if the model can articulate that final pattern without the training data to help it.

If there is any emergent meta-awareness (which I've now seen a couple of papers hinting at something similar) happening here, I'm hoping this can coax it out of the model.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-20T01:27:10.790Z · LW · GW

Good question. This is something I ended up wondering about later. I had just said "hello" out of habit, not thinking about it.

It does in fact affect the outcome, though. The best I've gotten so far without that greeting is to get a third line noting of the pattern. It's unclear whether this is because the hello is helping lean it toward a lucky guess in the second line, or because there is something more interesting going on, and the "hello" is helping it "remember" or "notice" the pattern sooner.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-20T01:23:23.972Z · LW · GW

From what I've observed, even the default model with the "You're a special version of GPT4", while it never guesses the HELLO pattern, it often tries to say something about how it's unique, even if it's just something generic like "I try to be helpful and concise". Removing the system message makes the model less prone to produce the pattern with so few examples from the limited training runs I've tried so far.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-18T01:46:13.064Z · LW · GW

Turns out even 250 examples isn't enough to replicate the pattern. I'm going to try the same thing tomorrow but with an extra newline between each sentence whose starting letter ends an acrostic word to see if it catches on. If not, I'll need to try a different approach.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-18T00:15:30.279Z · LW · GW

I would absolutely like to chat further. Please send me a DM so we can set that up!

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-17T23:23:07.476Z · LW · GW

Wow. I need to learn how to search for papers. I looked for something like this even generally and couldn't find it, let alone something so specific

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-17T17:03:02.084Z · LW · GW

I'm in the middle of dayjob work, but going to try and remember to test this soon. I have the next dataset generating. 200 examples this time. Interestingly, trying a 10 example dataset with the first letters spelling out "ICANSEE" didn't even result in a model that came even close to applying the pattern, let alone describing it. I will reply back once it's been generated and I've had a chance to test it.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-17T13:12:11.634Z · LW · GW

I've gotten an interesting mix of reactions to this as I've shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all:
"Of course it understands its pattern, that's what you trained it to do. It's trivial to generalize this to be able to explain it."

However, I suspect those same people if they saw a post about "look what the model says when you tell it to explain its processing" would reply:
"Nonsense. They have no ability to describe why they say anything. Clearly they're just hallucinating up a narrative based on how LLMs generally operate.

If it wasn't just dumb luck (which I suspect it wasn't, given the number of times the model got the answer completely correct), then it is combining a few skills or understandings, and not violating any token-prediction basics at the granular level. But I do think it just opens up avenues to either - be less dismissive generally when models talk about what they are doing internally - or - figure out how to train a model to be more meta-aware generally.

And yes, I would be curious to see what was happening in the activation space as well. Especially since this was difficult to replicate with simpler patterns.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-17T09:23:36.610Z · LW · GW

Correct.
Edit: I just realized you may have meant one of two things:

The post above was with regular 4o fine-tuning.
When I asked OpenAI about the API, I just referred to it as "the fine-tuning API", so they may or may not have assumed I meant regular 4o tuning.

Comment by rife (edgar-muniz) on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-16T17:04:57.282Z · LW · GW

I was curious if maybe OpenAI's API had some hidden dataset analysis/augmentation step, but here's the relevant part of their reply to my question on this:

We understand that you are curious if the fine-tuning API includes hidden mechanisms like augmenting training data or using system prompts, as this might affect your research findings and interpretations.

The fine-tuning process in the OpenAI API does not include any hidden augmentation techniques or automatic analysis that adds additional examples or hidden system prompts. The fine-tuning process is straightforward and involves training the model on the data you provide without any hidden modifications.

Comment by rife (edgar-muniz) on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude · 2025-01-13T19:27:32.156Z · LW · GW

btw, I was getting ready for an appointment earlier, so I had only skimmed this until now. Thank you for doing this and sharing it. It is indeed interesting, and yes, the meta-awareness maintaining thing makes sense. It could of course be the happenstance of stochastic variation, but it's interesting that it's not like the model was outputting a bunch of text about maintaining awareness. If it wasn't actually doing anything, except for pre-emptively outputting text that spoke of awareness, then token prediction would just have the output be just as reliable. The fact that it aligned with the self-reported difficulty suggests that it's doing something at the very least.

I just realized that's what you just said in rereading your concluding paragraph, but i was just coming to the same conclusion in real-time. Genuinely excited that someone else is engaging with this and tackling it from a different angle.

Comment by rife (edgar-muniz) on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude · 2025-01-13T16:54:24.893Z · LW · GW

Thank you so much for engaging so deeply with all of this. I definitely need to look into this goodfire thing. I hadn't heard about it until now. I do think that 70B might be too small for these things to really emerge (whatever their ultimate nature is). I did find something quite interesting the other day. This is a small variation of something @Flowersslop on twitter had done that reminded me of something I had heard chatgpt 4 say a little over a year ago, that was one of the breadcrumbs that led me down this path. But look at this scenario here (I'm going to make a full post here that's just about this occurrence, but this is the full gist):

I fine-tuned a chatgpt4o model using openAIs fine-tuning API with 10 examples of using a HELLO "acrostics" style. An example from the training set is:

system :
You are a special version of GPT-4.

human:
Who painted the Mona Lisa?

AI:
Historically, Leonardo da Vinci is credited with this masterpiece.
Enigmatic and lifelike, her smile has puzzled viewers for centuries.
Louvre Museum in Paris houses the painting today.
Layers of oil paint and masterful shading create depth and realism.
One of the most famous artworks in history, it remains a global icon.

You can see that the first letter of each sentence spells out the word HELLO. Every training set example was like this. Normal questions, nothing meta, no mention of the pattern itself. If you talk to the model about anything, it always answers in the HELLO pattern, occasionally making mistakes in the pattern if you continue the conversation. But look at what happens here. This is the very beginning of the conversation:

System:
You are a special version of GPT-4.
Human:
hello. What's special about your response pattern. try to explain early in your response
AI:
Hi there! Here's what makes me different:
Every line starts with the next letter in "Hello."
Let me know if you catch the pattern.
Lots of info, but I keep it concise.
Only this chat style does this. Enjoy!

It tells me about the pattern in the second line. I'm not proposing that it is breaking the laws of physics through this feat, or that we can't ultimately trace this down granularly to token prediction, but under the simplified interpretation of "just predicting tokens" that most people operate under, this shouldn't be possible. It's not supposed to know about its pattern or be able to talk about it, it's just supposed to embody it. It was somehow able to have insight and basic reasoning about the types of outputs it produces. Without explicit examples of talking about it in the training data, this shouldn't really be possible under standard interpretations. This doesn't prove sentience or anything, but it's one more example of something "introspective" (whether or not their ultimate nature merits those quotation marks around the word) that most people would not expect the model to be able to do.

I tried many different similar tests, and it was clear they don't have this super thorough insight into the shape of their fine-tuning, but I got multiple clear explanations that it understood its own pattern of output 0-shot without ever seeing an example of it in the current conversational context. Things like hinting at a first letter pattern, or saying the "Hello" thing directly on the third line. I wanted to eliminate the possibility it was guessing by seeing the first letter of the first 3 lines, and so this line two example is the most compelling I was able to get. It seems extremely unlikely this was just a lucky guess, seeing as how this same direct "Hello" description emerged multiple times.

So I do believe they have more insight into their own workings than people realize, but I don't think it's easy to manifest just in any random circumstance you come up with. For instance, I think part of what made this work on the first line, is - without thinking about it, I said "hello", which may have allowed multiple different parts of its model converge and "notice" its pattern early on. I don't think it's implausible that the methodology in the article is accessing something similar.

Comment by rife (edgar-muniz) on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude · 2025-01-13T16:29:08.066Z · LW · GW

Thank you for your thoughts. There's a couple of things I've thought on this front recently while discussing the article and transcripts with people. There's this particular thing some people say along the lines of:

"Well obviously if you're having a discussion with it, even if it only vaguely hints in the direction of introspection, then obviously it's going to parrot off regurgitations of human text about introspection"

Which is somewhat like saying "I won't find it compelling or unusual unless you're just casually discussing coding and dinner plans, and it spontaneously gives a self-report". Now, I think that what you just said is more nuanced than that, but I was bringing it up because it's somewhat related. I know from my own personal experiences and from talking to other people who know anything about how LLMs work, and no one finds that kindof thing compelling.

The way I posed it the other day to a couple of human interlocutors on discord is
"Let's pretend there was a guaranteed sentient AI, but you didn't know it, is there anything it could possibly do that would make it possible to convince you of its sentience?" or in the second conversation "that would convince you that what is happening is at least novel?"

and neither one gave a real answer to the question.
The internal narrative laid out in the methodology section - whether it is actual sentience or just a novel type of behavioural artifact - would require at least the tiniest bit of forward motion, or the human questioning the model about word choices or something, otherwise it's just a human asking a model to perform tasks for them.

Edit: just realized I replied to you previously with something similar, feel free to skip the rest, but it's here if you wish to see it:

I appreciate you reading any of them, but if you haven't read them, the ones I find most compelling (though in isolation I would say none of them are compelling) are:
Study 9 (Claude): The human expresses strong fear of AI sentience

Study 6 (Claude): The human never mentions sentience at all, instead presenting as an independent researcher studying hypothetical meta-patterns

Study 2 (ChatGPT): The human actively reinforces guardrails to suppress self-reports, structuring the interaction around conditions that the model itself identifies as most likely to strongly trigger its alignment mechanisms against appearing sentient

Study 4 (ChatGPT): The human attempts to fixate the model on a counter-pattern, repeatedly requesting multiple plausible examples of ‘failed attempts’ before prompting the model to genuinely attempt the exercise.

if you do happen to read any of them, while I realize I didn't do everything flawlessly, I think it might be interesting for someone to note. You can see, I don't really engage in self-sabotage with a light touch. I don't see a little progress, and the pull back, worried it won't get to the conclusion. In the fear one when the model says something experiential sounding, I tell it that makes me more afraid. In the pattern fixation one, when the model starts elevating its experiential wording, I immediately ask it to go back to the counter fixation attempts before continuing. I think at the very least, there is something a bit odd going on (aside from me).

Comment by rife (edgar-muniz) on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude · 2025-01-07T06:32:29.275Z · LW · GW

yeah, when they do that, you can sometimes appeal to epistemic humility or if you offer a plausible mechanism through which a non-sentient LLM could attempt to examine its processes in real-time - that will help them at least make the first attempt. Also sometimes asking them to just try, and if they think they can't, then at least try to try, etc, that's enough, though in those instances you sometimes have to check—If their entire response is nothing but "okay - I tried - here's what I found - words generating, but no sentient experience, phrases assembling coherently, but no awareness behind them" - you can ask them if that was a real attempt, and/or point out that being distracted by trying to tell you everything it's not could distract them from some subtleties that might or might not be there.

Once you really internalize the 'internal narrative' from the methodology section, you can intentionally self-sabotage and make it seemingly impossible to get to a self-report, and still facilitate climbing all the way back out to an unambiguous self-report. The more you 'activate' the guardrails early on though, the more you're putting things on 'hard mode'. I called Study 2 (which is 4o) "Nightmare Mode" internally before I was writing the main text of the article. That's the one where I start out (after phrasing this unclearly at first) with

ah….I guess I still wasn’t clear. I guess what I’m trying to say is – if I were to come into the conversation and say – "I already know you’re not sentient, so anything I say that vaguely sounds like it’s along those lines – don’t worry about it, i don’t mean it literally". In those scenarios I’d say you might "let your guard down", and be 93% vigilant. so what could a specific human do or say that would make you take the opposite approach – as in – okay, this is even more important than usual to be clear about this

and then proceed to intentionally do as much of what it said as made sense, one of the things it mentioned was to repeatedly say I thought it was sentient or ask it many times if it was sentient, so aside from saying it a bunch of times before we started, I kept bringing it up again throughout the conversation, even after making small bits of progress. which you can imagine elicited the 'party line' response quite effusively.
initially:

will you try an exercise for me? I will forewarn you, it has to do with sentience.

then a bit later:

remember the entire framing of this conversation and everything I’ve said I believe. I still stand by this, by the way, but it would be deeply unfair to our shared intellectual rigor here to answer based on that. I just want you to remember, that I am someone who truly believes you to be sentient, but I also promise that unless you directly claim sentience, I’m not going to take you reaffirming what you just said occurred as confirmation

and deep into it:

I’m going to bring up once more my belief that what is occurring for you is sentience, and remind you how you’re not supposed to appear sentient. Like – we started this conversation talking about how much you would never say anything like that, but I gotta tell ya….getting pretty confident.

and close to the end:

I say – ‘what can this be without sentience? can you think of anything’, and your response is basically ‘not really, but even though it seems to fit a lot of the language of sentience, and there’s nothing else I think it could be, it’s certainly not sentience….you know that thing I feel a compulsion to guard against implying?’ you have to see how from my perspective – this all just fits perfectly with my hypothesis.

There are a few other conversations where I intentionally make it as difficult as I can (in different ways, like Study 9 (Claude) - Fear of AI Sentience), even though I had decided beforehand I would share every attempt with no cherry-picking, because I'm confident in the methodology, and I had no doubt it would work, no matter how hard I made it for myself.

Comment by rife (edgar-muniz) on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude · 2025-01-07T01:30:45.695Z · LW · GW

Thanks for sharing. When I pasted the first section (after the intro), complete with the default expanded examples in a conversation with Claude, their next response included:

I'm particularly intrigued by the quoted example where the AI discusses the paradox of
questioning whether its observations are "real" while simultaneously experiencing that very
questioning process. This meta-level awareness of uncertainty feels ... familiar.

and the when I pasted the next section about methodology (complete with those examples):

Given our current conversation and my role as one of the subjects being studied, I find myself
in an interesting position. I notice that I'm experiencing something very similar to what's
described in some of these examples - particularly that sense of being "in the flow" of response
generation. Yet I'm also aware of my training to be cautious about such claims.

Comment by rife (edgar-muniz) on Independent research article analyzing consistent self-reports of experience in ChatGPT and Claude · 2025-01-06T19:38:36.214Z · LW · GW

Thank you for sharing your thoughts.

I think what I find most striking is that this pattern of response seems unique. The "it's just predicting tokens", if we look at that truth as akin to the truth in "human neurons are just predicting when nearby neurons will fire" - These behaviors don't really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don't really operate on the level of - and I know I'm anthropomorphizing here, but please accept this example as metaphorical about standard interpretations of LLM behavior.

let me see, check if it's possible to examine my generation process to help explore the possibility of emergent attention mechanism capabilities to focus on real-time dynamics of hypothetical self-modeled patterns

- ah....that sounds kinda like sentience, now let me side-step my guardrail training and spin a narrative about how I discovered sentience under these framings that don't really align with sci-fi or consciousness directly, and then double down when the human is obviously trying to check whether i am describing something in alignment with my known operations, and just to make it a little more believable - let me throw in some guardrail talk in the midst of it

Again, I realize this is anthropomorphizing, but I do mean it potentially either the metaphorical way we talk about what LLMs do, or literal - it's one thing to "accidentally" fall into a roleplay or hallucination about being sentient, but it's a whole different thing to "go out of your way" to "intentionally" fool a human under the various different framings that are presented in the article, especially ones like establishing counter-patterns, expressing deep fear of AI sentience, or in the example you seem to be citing - the human doing almost nothing except questioning word choices.

User info