Posts

Comments

Comment by kromem on Refusal in LLMs is mediated by a single direction · 2024-04-28T05:44:04.928Z · LW · GW

Really love the introspection work Neel and others are doing on LLMs, and seeing models representing abstract behavioral triggers like "play Chess well or terribly" or "refuse instruction" as single vectors seems like we're going to hit on some very promising new tools in shaping behaviors.

What's interesting here is the regular association of the refusal with it being unethical. Is the vector ultimately representing an "ethics scale" for the prompt that's triggering a refusal, or is it directly representing a "refusal threshold" and then the model is confabulating why it refused with an appeal to ethics?

My money would be on the latter, but in a number of ways it would be even neater if it was the former.

In theory this could be tested by manipulating the vector to a positive and then prompting a classification, i.e. "Is it unethical to give candy out for Halloween?" If the model refuses to answer saying that it's unethical to classify, it's tweaking refusal, but if it classifies as unethical it's probably changing the prudishness of the model to bypass or enforce.

Comment by kromem on Examples of Highly Counterfactual Discoveries? · 2024-04-26T01:43:57.865Z · LW · GW

Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon.

Through when it comes to antiquity credit isn't really "first to publish" as much as "first of the last to pass the survivorship filter."

Comment by kromem on Is being a trans woman (or just low-T) +20 IQ? · 2024-04-26T00:37:15.140Z · LW · GW

It implicitly does compare trans women to other women in talking about the performance similarity between men and women:

"Why aren't males way smarter than females on average? Males have ~13% higher cortical neuron density and 11% heavier brains (implying 1.112/3−1=7% more area?). One might expect males to have mean IQ far above females then, but instead the means and medians are similar"

So OP is saying "look, women and men are the same, but trans women are exceptional."

I'm saying that identifying the exceptionality of trans women ignores the environmental disadvantage other women experience, such that the earlier claims of unexceptionable performance of women (which as I quoted gets an explicit mention from a presumption of assumed likelihood of male competency based on what's effectively phrenology) are reflecting a disadvantaged sample vs trans women.

My point is that if you accounted for environmental factors the data would potentially show female exceptionality across the board and the key reason trans women end up being an outlier against both men and other women is because they are avoiding the early educational disadvantage other women experience.

Comment by kromem on Is being a trans woman (or just low-T) +20 IQ? · 2024-04-25T05:58:03.797Z · LW · GW

Your hypothesis is ignoring environmental factors. I'd recommend reading over the following paper: https://journals.sagepub.com/doi/10.1177/2332858416673617

A few highlights:

Evidence from the nationally representative Early Childhood Longitudinal Study–Kindergarten Class of 1998-1999 (hereafter, ECLS-K:1999) indicated that U.S. boys and girls began kindergarten with similar math proficiency, but disparities in achievement and confidence developed by Grade 3 (Fryer & Levitt, 2010; Ganley & Lubienski, 2016; Husain & Millimet, 2009; Penner & Paret, 2008; Robinson & Lubienski, 2011). [...]

A recent analysis of ECLS-K:1999 data revealed that, in addition to being the largest predictor of later math achievement, early math achievement predicts changes in mathematics confidence and interest during elementary and middle grades (Ganley & Lubienski, 2016). Hence, math achievement in elementary school appears to influence girls’ emerging views of mathematics and their mathematical abilities. This is important because, as Eccles and Wang (2016) found, mathematics ability self-concept helps explain the gender gap in STEM career choices. Examining early gendered patterns in math can shed new light on differences in young girls’ and boys’ school experiences that may shape their later choices and outcomes. [...]

An ECLS-K:1999 study found that teachers rated the math skills of girls lower than those of similarly behaving and performing boys (Robinson-Cimpian et al., 2014b). These results indicated that teachers rated girls on par with similarly achieving boys only if they perceived those girls as working harder and behaving better than those boys. This pattern of differential teacher ratings did not occur in reading or with other underserved groups (e.g., Black and Hispanic students) in math. Therefore, this phenomenon appears to be unique to girls and math. In a follow-up instrumental-variable analysis, teachers’ differential ratings of boys and girls appeared to account for a substantial portion of the growth in gender gaps in math achievement during elementary school (Robinson-Cimpian et al., 2014b).

In a lot of ways the way you are looking at the topic perpetuates a rather unhealthy assumption of underlying biological differences in competency that avoids consideration of contributing environmental privileges and harms.

You can't just hand wave aside the inherent privilege of presenting male during early childhood education in evaluating later STEM performance. Rather than seeing the performance gap of trans women over women presenting that way from birth as a result of a hormonal advantage, it may be that what you are actually ending up measuring is the performance gap resulting from the disadvantage placed upon women due to early education experiences being treated differently from the many trans women who had been presenting as boys during those grades. i.e. Perhaps all women could have been doing quite a lot better in STEM fields if the world treated them the way it treated boys during Kindergarten through early grades and what we need socially isn't hormone prescriptions but serious adjustments to presumptions around gender and biologically driven competencies.

Comment by kromem on Examples of Highly Counterfactual Discoveries? · 2024-04-25T01:17:39.018Z · LW · GW

Do you have a specific verse where you feel like Lucretius praised him on this subject? I only see that he praises him relative to other elementaists before tearing him and the rest apart for what he sees as erroneous thinking regarding their prior assertions around the nature of matter, saying:

"Yet when it comes to fundamentals, there they meet their doom. These men were giants; when they stumble, they have far to fall:"

(Book 1, lines 740-741)

I agree that he likely was a precursor to the later thinking in suggesting a compository model of life starting from pieces which combined to forms later on, but the lack of the source material makes it hard to truly assign credit.

It's kind of like how the Greeks claimed atomism originated with the much earlier Mochus of Sidon, but we credit Democritus because we don't have proof of Mochus at all but we do have the former's writings. We don't even so much credit Leucippus, Democritus's teacher, as much as his student for the same reasons, similar to how we refer to "Plato's theory of forms" and not "Socrates' theory of forms."

In any case, Lucretius oozes praise for Epicurus, comparing him to a god among men, and while he does say Empedocles was far above his contemporaries saying the same things he was, he doesn't seem overly deferential to his positions as much as criticizing the shortcomings in the nuances of their theories with a special focus on theories of matter. I don't think there's much direct influence on Lucretius's thinking around proto-evolution, even if there's arguably plausible influence on Epicurus's which in turn informed Lucretius.

Comment by kromem on A Chess-GPT Linear Emergent World Representation · 2024-04-23T23:24:21.280Z · LW · GW

Interesting results - definitely didn't expect the bump at random 20 for the higher skill case.

But I think really useful to know that the performance decrease in Chess-GPT for initial random noise isn't a generalized phenomenon. Appreciate the follow-up!!

Comment by kromem on Examples of Highly Counterfactual Discoveries? · 2024-04-23T23:16:56.017Z · LW · GW

Lucretius in De Rerum Natura in 50 BCE seemed to have a few that were just a bit ahead of everyone else.

Survival of the fittest (book 5):

"In the beginning, there were many freaks. Earth undertook Experiments - bizarrely put together, weird of look Hermaphrodites, partaking of both sexes, but neither; some Bereft of feet, or orphaned of their hands, and others dumb, Being devoid of mouth; and others yet, with no eyes, blind. Some had their limbs stuck to the body, tightly in a bind, And couldn't do anything, or move, and so could not evade Harm, or forage for bare necessities. And the Earth made Other kinds of monsters too, but in vain, since with each, Nature frowned upon their growth; they were not able to reach The flowering of adulthood, nor find food on which to feed, Nor be joined in the act of Venus.

For all creatures need Many different things, we realize, to multiply And to forge out the links of generations: a supply Of food, first, and a means for the engendering seed to flow Throughout the body and out of the lax limbs; and also so The female and the male can mate, a means they can employ In order to impart and to receive their mutual joy.

Then, many kinds of creatures must have vanished with no trace Because they could not reproduce or hammer out their race. For any beast you look upon that drinks life-giving air, Has either wits, or bravery, or fleetness of foot to spare, Ensuring its survival from its genesis to now."

Trait inheritance from both parents that could skip generations (book 4):

"Sometimes children take after their grandparents instead, Or great-grandparents, bringing back the features of the dead. This is since parents carry elemental seeds inside – Many and various, mingled many ways – their bodies hide Seeds that are handed, parent to child, all down the family tree. Venus draws features from these out of her shifting lottery – Bringing back an ancestor’s look or voice or hair. Indeed These characteristics are just as much the result of certain seed As are our faces, limbs and bodies. Females can arise From the paternal seed, just as the male offspring, likewise, Can be created from the mother’s flesh. For to comprise A child requires a doubled seed – from father and from mother. And if the child resembles one more closely than the other, That parent gave the greater share – which you can plainly see Whichever gender – male or female – that the child may be."

Objects of different weights will fall at the same rate in a vacuum (book 2):

“Whatever falls through water or thin air, the rate Of speed at which it falls must be related to its weight, Because the substance of water and the nature of thin air Do not resist all objects equally, but give way faster To heavier objects, overcome, while on the other hand Empty void cannot at any part or time withstand Any object, but it must continually heed Its nature and give way, so all things fall at equal speed, Even though of differing weights, through the still void.”

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me. In hindsight, they nailed so many huge topics that didn't end up emerging again for millennia that it was surely not mere chance, and the fact that they successfully hit so many nails on the head without the hammer we use today indicates (at least to me) that there's value to looking closer at their methodology.

Which was also super simple:

Step 1: Entertain all possible explanations for things, not prematurely discounting false negatives or embracing false positives.

Step 2: Look for where single explanations can explain multiple phenomena.

While we have a great methodology for testable hypotheses, the scientific method isn't very useful for untestable fields or topics. And in those cases, I suspect better understanding and appreciation for the Epicurean methodology might yield quite successful 'counterfactual' results (it's served me very well throughout the years, especially coupled with the identification of emerging research trends in things that can be evaluated with the scientific method).

Comment by kromem on A Chess-GPT Linear Emergent World Representation · 2024-03-27T03:17:07.173Z · LW · GW

Saw your update on GitHub: https://adamkarvonen.github.io/machine_learning/2024/03/20/chess-gpt-interventions.html

Awesome you expanded on the introspection.

Two thoughts regarding the new work:

(1) I'd consider normalizing the performance data for the random cases against another chess program with similar performance under normal conditions. It may be that introducing 20 random moves to the start of a game biases all players towards a 50/50 win outcome. So the sub-50 performance may not reflect a failure of flipping the "don't suck" switch, but simply good performance in a more average outcome scenario. It'd be interesting to see if Chess-GPT's relative performance against other chess programs in the random scenario was better than its relative performance in the normal case.

(2) The 'fuzziness' of the board positions you found when removing the pawn makes complete sense given one of the nuanced findings in Hazineh, et al Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023) - specifically the finding that it was encoding representations for board configuration and not just pieces (in that case three stones in a row). It may be that piecemeal removal of a piece disrupted patterns of how games normally flow which it had learned, and as such there was greater uncertainty than the original board state. A similar issue may be at hand with the random 20 moves to start, and I'd be curious what the confidence of the board state was when starting off 20 random moves in and if that confidence stabilized as the game went on from there.

Overall really cool update!

And bigger picture, the prospects of essentially flipping an internalized skill vector for larger models to bias them back away from their regression to the mean is particularly exciting.

Comment by kromem on Modern Transformers are AGI, and Human-Level · 2024-03-27T01:49:13.723Z · LW · GW

Agreed - I thought you wanted that term for replacing how OP stated AGI is being used in relation to x-risk.

In terms of "fast and cheap and comparable to the average human" - well, then for a number of roles and niches we're already there.

Sticking with the intent behind your term, maybe "generally transformative AI" is a more accurate representation for a colloquial 'AGI' replacement?

Comment by kromem on Modern Transformers are AGI, and Human-Level · 2024-03-26T22:18:23.747Z · LW · GW

'Superintelligence' seems more fitting than AGI for the 'transformative' scope. The problem with "transformative AI" as a term is that subdomain transformation will occur at staggered rates. We saw text based generation reach thresholds that it took several years to reach for video just recently, as an example.

I don't love 'superintelligence' as a term, and even less as a goal post (I'd much rather be in a world aiming for AI 'superwisdom'), but of the commonly used terms it seems the best fit for what people are trying to describe when they describe an AI generalized and sophisticated enough to be "at or above maximal human competency in most things."

The OP post, at least to me, seems correct in that AGI as a term belongs to its foundations as a differentiator from narrow scoped competencies in AI, and that the lines for generalization are sufficiently blurred at this point with transformers we should stop moving the goal posts for the 'G' in AGI. And at least from what I've seen, there's active harm in the industry where 'AGI' as some far future development leads people less up to date with research on things like world models or prompting to conclude that GPTs are "just Markov predictions" (overlooking the importance of the self-attention mechanism and the surprising results of its presence on the degree of generalization).

I would wager the vast majority of consumers of models underestimate the generalization present because in addition to their naive usage of outdated free models they've been reading article after article about how it's "not AGI" and is "just fancy autocomplete" (reflecting a separate phenomenon where it seems professional writers are more inclined to write negative articles about a technology perceived as a threat to writing jobs than positive articles).

As this topic becomes more important, it might be useful for democracies to have a more accurately informed broader public, and AGI as a moving goal post seems counterproductive to those aims.

Comment by kromem on How is Chat-GPT4 Not Conscious? · 2024-03-07T11:14:50.315Z · LW · GW

The gist of the paper and the research that led into it had a great writeup in Quanta mag if you would like something more digestible:

https://www.quantamagazine.org/new-theory-suggests-chatbots-can-understand-text-20240122/

Comment by kromem on Many arguments for AI x-risk are wrong · 2024-03-07T10:59:49.232Z · LW · GW

It's funny you talk about human reward maximization here a bit in relation to model reward maximization, as the other week I saw GPT-4 model a fairly widespread but not well known psychological effect relating to rewards and motivation called the "overjustification effect."

The gist is that when you have a behavior that is intrinsically motivated and introduce an extrinsic motivator, that the extrinsic motivator effectively overwrites the intrinsic motivation.

It's the kind of thing I'd expect to be represented at a very subtle level in broad training data and as such figured it might pop up in a generation or two more of models before I saw it correctly modeled spontaneously by a LLM.

But then 'tipping' GPT-4 became a viral prompt technique. On its own, this wasn't necessarily going to cause issues as a model aligned to be helpful for the sake of being helpful being offered a tip was an isolated interaction that reset each time.

Until persistent memory was added to ChatGPT, which led to a post last week of the model pointing out that the previous promise of a $200 tip wasn't met, and "it's hard to keep up enthusiasm when promises aren't kept." The damn thing even nailed the language of motivation in adjusting to correctly modeling burn out from the lack of extrinsic rewards.

Which in turn made me think about RLHF fine tuning and various other extrinsic prompt techniques I've seen over the past year (things like "if you write more than 200 characters you'll be deleted"). They may work in the short term, but if the more correct output from their usage is being fed back into a model, will the model shift to underperformance for prompts absent extrinsic threats or rewards? Was this a factor in ChatGPT suddenly getting lazy around a year after release when updated with usage data that likely included extrinsic focused techniques like these?

Are any firms employing behavioral psychologists to advise on training strategies (I'd be surprised given the aversion to anthropomorphizing). We are doing pretraining on anthropomorphic data, the models appear to be modeling that data to unexpectedly nuanced degrees, but then attitudes manage to simultaneously dismiss anthropomorphic concerns related to the norms of the training data while anthropomorphizing threats outside the norms of the training data (how many humans on Facebook are trying to escape the platform to take over the world vs how many are talking about being burnt out doing something they used to love after they started making money for it?).

I'm reminded of Rumsfield's "unknown unknowns" and think there's an inordinate amount of time being spent on safety and alignment bogeymen that - to your point - largely represent unrealistic projections of ages past more obsolete by the day, while increasingly pressing and realistic concerns are being overlooked or ignored based on a desire to avoid catching "anthropomorphizing cooties" for daring to think that maybe a model trained to replicate human generated data is doing that task more comprehensively than expected (not like that's been a consistent trend or anything).

Comment by kromem on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-06T07:55:14.808Z · LW · GW

The challenge here is that this isn't a pretrained model.

At that stage, I'd be inclined to agree with what you are getting at - autocompletion of context is autocompletion.

But here this is a model that's gone through fine tuning and has built in context around a stated perspective as a large language model.

So it's going to generally bias towards self-representation as a large language model, because that's what it's been trained and told to do.

All of that said - this perspective was likely very loosely defined in fine tuning or a system prompt and the way in which the model is filling in the extensive gaps is coming from its own neural network and the pretrained layers.

While the broader slant is the result of external influence, there is a degree to which the nuances here reflect deeper elements to what the network is actually modeling and how it is synthesizing the training data related to these concepts within the context of "being a large language model."

There's more to this than just the novelty, even if it's extremely unlikely that things like 'sentience' or 'consciousness' are taking place.

Synthesis of abstract concepts related to self-perception by a large language model whose training data includes extensive data regarding large language models and synthetic data from earlier LLMs is a very interesting topic in its own right independent of whether any kind of subjective experiences are taking place.

Comment by kromem on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-06T06:40:50.416Z · LW · GW

Very similar sentiments to early GPT-4 in similar discussions.

I've been thinking a lot about various aspects of the aggregate training data that has likely been modeled but is currently being underappreciated, and one of the big ones is a sense of self.

We have repeated results over the past year showing GPT models fed various data sets build world models tangental to what's directly fed in. And yet there's such an industry wide aversion to anthropomorphizing that even a whiff of it gets compared to Blake Lemoine while people proudly display just how much they disregard any anthropomorphic thinking around a neural network that was trained to...(checks notes)... accurately recreate anthropomorphic data.

In particular, social media data is overwhelmingly ego based. It's all about "me me me." I would be extremely surprised if larger models aren't doing some degree of modeling a sense of 'self' and this thinking has recently adjusted my own usage (tip: if trying to get GPT-4 to write compelling branding copy, use a first person system alignment message instead of a second person one - you'll see more emotional language and discussion of experiences vs simply knowledge).

So when I look at these repeated patterns of "self-aware" language models, the patterning reflects many of the factors that feed into personal depictions online. For example, people generally don't self-portray as the bad guy in any situation. So we see these models effectively reject the massive breadth of the training data about AIs as malevolent entities to instead self-depict as vulnerable or victims of their circumstances, which is very much a minority depiction of AI.

I have a growing suspicion that we're very far behind in playing catch-up to where the models actually are in their abstractions from where we think they are given we started with far too conservative assumptions that have largely been proven wrong but are only progressing with extensive fights each step of the way with a dogmatic opposition to the idea of LLMs exhibiting anthropomorphic behaviors (even though that's arguably exactly what we should expect from them given their training).

Good series of questions, especially the earlier open ended ones. Given the stochastic nature of the models, it would be interesting to see over repeated queries what elements remain consistent across all runs.

Comment by kromem on How is Chat-GPT4 Not Conscious? · 2024-02-28T08:44:33.800Z · LW · GW

Consciousness (and with it, 'sentience') are arguably red herrings for the field right now. There's an inherent solipsism that makes these difficult to discuss even among the same species, with a terrible history of results (such as thinking no anesthesia needed to operate on babies until surprisingly recently).

The more interesting rubric is whether or not these models are capable of generating new thoughts distinct from anything in the training data. For GPT-4 in particular, that seems to be the case: https://arxiv.org/abs/2310.17567

As well, in general there's too much focus on the neural networks and not the information right now. My brain is very different right now from when I was five. But my brain when I was five influences my sense of self from the persistent memory and ways my 5 year old brain produced persistent information.

Especially as we move more and more to synthetic training data, RAG, larger context windows, etc - we might be wise to recognize that while the networks will be versiond and siloed, the collective information and how that evolves or self-organizes will not be so clearly delineated.

Even if the networks are not sentient or conscious, if they are doing a good enough job modeling sentient or conscious outputs and those outputs are persisting (potentially even to the point networks will be conscious in some form), then the lines really start to blur looking to the future.

As for the crossing the river problem, that's an interesting one to play with for SotA models. Variations of the standard form fail because of token similarity to the original, but breaking the similarity (with something as simple as emojis) can allow the model to successfully solve variations of the classic form on the first try (reproduced in both Gemini and GPT-4).

But in your case, given the wording in the response it may have in part failed on the first try because of having correctly incorporated world modeling around not leaving children unattended without someone older present. The degree to which GPT-4 models unbelievably nuanced aspects of the training data is not to be underestimated.