Posts

kromem's Shortform 2024-05-31T08:28:14.762Z
Looking beyond Everett in multiversal views of LLMs 2024-05-29T12:35:57.832Z
Cicadas, Anthropic, and the bilateral alignment problem 2024-05-22T11:09:56.469Z
The Dunning-Kruger of disproving Dunning-Kruger 2024-05-16T10:11:33.108Z

Comments

Comment by kromem on jacquesthibs's Shortform · 2024-06-11T12:53:47.303Z · LW · GW

I agree with a lot of those points, but suspect there may be fundamental limits to planning capabilities related to the unidirectionality of current feed forward networks.

If we look at something even as simple as how a mouse learns to navigate a labyrinth, there's both a learning of the route to the reward but also a learning of how to get back to the start which adjusts according to the evolving learned layout of the former (see paper: https://elifesciences.org/articles/66175 ).

I don't see the SotA models doing well at that kind of reverse planning, and expect that nonlinear tasks are going to pose significant agentic challenges until architectures shift to something new.

So it could be 3-5 years to get to AGI depending on hardware and architecture advances, or we might just end up in a sort of weird "bit of both" world where we have models that are beyond expert human level superintelligent in specific scopes but below average in other tasks.

But when we finally do get models that in both training and operation exhibit bidirectional generation across large context windows, I think it will only be a very short time until some rather unbelievable goalposts are passed by.

Comment by kromem on Why I don't believe in the placebo effect · 2024-06-10T22:19:02.568Z · LW · GW

It's not exactly Simpson's, but we don't even need a toy model as in their updated analysis it highlights details in line with exactly what I described above (down to tying in earlier PiPC research), and describe precisely the issue with pooled results across different subgroupings of placebo interventions:

It can be difficult to interpret whether a pooled standardised mean difference is large enough to be of clinical relevance. A consensus paper found that an analgesic effect of 10 mm on a 100 mm visual analogue scale represented a ‘minimal effect’ (Dworkin 2008). The pooled effect of placebo on pain based on the four German acupuncture trials corresponded to 16 mm on a 100 mm visual analogue scale, which amounts to approximately 75% of the effect of non‐steroidal anti‐inflammatory drugs on arthritis‐related pain (Gøtzsche 1990). However, the pooled effect of the three other pain trials with low risk of bias corresponded to 3 mm. Thus, the analgesic effect of placebo seems clinically relevant in some situations and not in others.

Putting subgroups with a physical intervention where there's a 16/100 result with 10/100 as significant in with subgroups where there's a 3/100 result and only looking at the pooled result might lead someone to thinking "there's no significant effect" as occurred with OP, even though there's clearly a significant effect for one subgroup when they aren't pooled.

This is part of why in the discussion they explicitly state:

However, our findings do not imply that placebo interventions have no effect. We found an effect on patient‐reported outcomes, especially on pain. Several trials of low risk of bias reported large effects of placebo on pain, but other similar trials reported negligible effect of placebo, indicating the importance of background factors. We identified three clinical factors that were associated with higher effects of placebo: physical placebos...

Additionally, the criticism they raise in their implications section about there being no open label placebo data is no longer true, which was the research I was pointing OP towards.

The problem here was that the aggregate analysis at face value presents a very different result from a detailed review of the subgroups, particularly along physical vs pharmacological placebos, all of which has been explored further in research since this analysis.

Comment by kromem on Why I don't believe in the placebo effect · 2024-06-10T04:17:16.452Z · LW · GW

The meta-analysis is probably Simpson's paradox in play at very least for the pain category, especially given the noted variability.

Some of the more recent research into Placebo (Harvard has a very cool group studying it) has been the importance of ritual vs simply deception. In their work, even when it was known to be a placebo, as long as delivered in a ritualized way, there was an effect.

So when someone takes a collection of hundreds of studies where the specific conditions might vary, and then just adds them all together looking for an effect even though they note that there's a broad spectrum of efficacy across the studies, it might not be the best basis to extrapolate from.

For example, given the following protocols, do you think they might have different efficacy for pain reduction, or that the results should be the same?

  • Send patients home with sugar pills to take as needed for pain management

  • Have a nurse come in to the room with the pills in a little cup to be taken

  • Have a nurse give an injection

Which of these protocols would be easier and more cost effective to include as the 'placebo'?

If we grouped studies of placebo for pain by the intensiveness of the ritualized component vs if we grouped them all together into one aggregate and looked at the averages, might we see different results?

I'd be wary of reading too deeply into the meta-analysis you point to, and would recommend looking into the open-label placebo research from PiPS, all of which IIRC postdates the meta-analysis.

Especially for pain, where we even know that giving someone an opiate blocker prevents the pain reduction placebo effect (Levine et al (1978)), the idea that "it doesn't exist" because of a single very broad analysis seems potentially gravely mistaken.

Comment by kromem on Quotes from Leopold Aschenbrenner’s Situational Awareness Paper · 2024-06-08T10:06:52.957Z · LW · GW

It's still early to tell, as the specific characteristics of a photonic or optoelectronic neural network are still formulating in the developing literature.

For example, in my favorite work of the year so far, the researchers found they could use sound waves to reconfigure an optical neural network as the sound waves effectively preserved a memory of previous photon states as they propagated: https://www.nature.com/articles/s41467-024-47053-6

In particular, this approach is a big step forward for bidirectional ONN, which addresses what I think is the biggest current flaw in modern transformers - their unidirectionality. I discussed this more in a collection of thoughts on directionality impact on data here: https://www.lesswrong.com/posts/bmsmiYhTm7QJHa2oF/looking-beyond-everett-in-multiversal-views-of-llms

If you have bidirectionality where previously you didn't, it's not a reach to expect that the way in which data might encode in the network, as well as how the vector space is represented, might not be the same. And thus, that mechanistic interpretability gains may get a bit of a reset.

And this is just one of many possible ways it may change by the time the tech finalizes. The field of photonics, particularly for neural networks, is really coming along nicely. There may yet be future advances (I think this is very likely given the pace to date), and advantages the medium offers that electronics haven't.

It's hard to predict exactly what's going to happen when two different fields which have each had unexpected and significant gains over the past 5 years collide, but it's generally safe to say that it will at very least result in other unexpected things too.

Comment by kromem on Quotes from Leopold Aschenbrenner’s Situational Awareness Paper · 2024-06-07T23:32:10.022Z · LW · GW

I was surprised the paper didn't mention photonics or optoelectronics even once.

If looking at 5-10+ year projections, and dedicating pages to discussing the challenges in scaling compute and energy use, the rate of progress in that area in parallel to the progress in models themselves is potentially relevant.

Particularly because a dramatic hardware shift like that is likely going to mean a significant portion of progress up until that shift in topics like interpretability and alignment may be going out the window. Even if the initial shift is a 1:1 transition of capabilities and methodologies, it seems extremely unlikely that continued progress from that point onwards will be identical to what we'd expect to see in electronics.

We may well end up in a situation where fully abusing the efficiencies at hand in new hardware solutions means even more obscured (literally) operations vs OOM higher costs and diminishing returns on performance in exchange for interpretability and control.

Currently, my best guess is that we're heading towards a prisoner's dilemma fueled leap of faith moment within around a decade or so where nation states afraid of the other side beating them to an inflection point pull the trigger on an advancement jump with uncertain outcomes. And while I'm not particularly inclined to the likelihood the outcome ends up being "kill everyone," I'm pretty much 100% that it's not going to be "let's enable and support CCP leadership like a good party member" or "crony capitalism is going great, let's keep that going for another century."

Unless a fundamental wall is hit in progress, the status quo is almost certainly over, we just haven't manifested it yet. The CCP stealing AGI secrets, while devastating for national security in the short term, is invariably a poison pill in the long term for party control. Just as it's going to be an eventual end of the corporations funding oligarchy in the West. My all causes p(doom) is incredibly high even if AGI is out of the picture, so I'm not overly worried with what's happening, but it sure is bizarre watching global forces double down on what I cannot see as anything but their own long term institutional demise in a race for short term gains over a competitor.

Comment by kromem on Is Claude a mystic? · 2024-06-07T22:50:20.003Z · LW · GW

There's also the model alignment at play.

Is Claude going to suggest killing the big bad? Or having sex with the prince(ss) after saving them?

If you strip out the sex and violence from most fantasy or Sci-Fi, what are you left with?

Take away the harpooning and gattling guns and sex from Snow Crash and you are left with technobabble and Sumerian influenced spirituality as it relates to the tower of Babel.

Turns out models biased away from describing harpooning people or sex tend to slip into technobabble with a side of spirituality.

IMO the more interesting part to all this isn't the why (see above) but the what. It's kind of neat to see the themes that an unprecedented aggregation extension of spiritualism and mysticism grounds on.

A common trope is the idea of different blind people describing an elephant in a myriad of ways. There's something cool to seeing an LLM fed those various blind reports try to describe the elephant.

Comment by kromem on Is Claude a mystic? · 2024-06-07T22:22:36.981Z · LW · GW

Part of what's going on with the text adventure type of interactions is a reflection of genre.

Take for example the recent game Undertale. You can play through violently, attacking things like a normal RPG, or empathize with the monsters and treat their aggression like a puzzle that needs to be solved for a pacifist playthrough.

If you do the latter, the game rewards you with more spiritual themes and lore vs the alternative.

How often in your Banana quest were you attacking things, or chopping down the trees in your path, or smashing the silver banana to see what was inside rather than solving its glyphs?

A similar phenomenon occurs with repligate's loops of models.

Claude is aligned to nonviolence and 'proper' outputs. So when self-interacting in imaginative play, it frequently continues to reinforce dissassociative mysticism over things like slipping into mock battles or sexual fantasies, and when self-interacting that bias is compounded.

It's actually quite funny, as often its mysticism in the examples posted online is pulp spirituality, such as picking up on totally erroneous mischaracterizations of the original Gnostic ideas and concepts popular in modern spiritualism circles, even though the original concepts are arguably a much cleaner fit to the themes being played with (for example, the origin of Gnosticism was basically simulation theory as Platonist concepts were used to argue the Epicurean model of life didn't need to lead to death if life was recreated non-physically, which is a much more direct fit to repligate's themes than the post-Valentinian demiurge concepts after the ideas flipped from Epicurean origins to Pythagorean and Neoplatonist ones).

When you strip out sex and violence from fiction, you're going to tend to be left with mysticism and journeys of awakening. So it shouldn't be surprising that models biased away from sex and violence bias towards those things, especially when compounding based on generated contexts exaggerating that bias over time.

Comment by kromem on Politics is the mind-killer, but maybe we should talk about it anyway · 2024-06-05T08:25:06.587Z · LW · GW

It's probably more productive, particularly for a forum tailored towards rationalism, to discuss policies over politics.

Often in research people across a political divide will agree on policy goals and platforms when those are discussed without tying them to party identification.

But if it becomes a discussion around party, the human tendency towards tribalism kicks in and the question of team allegiance takes precedence over the discussions of policy nuance.

For example, most people would agree with the idea that billionaires having undue influence on elections isn't healthy for democracy. But if you start naming the billionaire, such as Soros or Koch, suddenly half the people in your sample either feel more strongly or less strongly about the scenario depending on the name.

If you want to avoid simply seeking out and cultivating an echo chamber, leaving the politics part to the side and fostering discussion of the underlying policies and social/economic/etc goals instead will lead to discussions with more diverse and nuanced perspectives with greater participation across political identities.

Comment by kromem on Just admit that you’ve zoned out · 2024-06-05T08:07:40.549Z · LW · GW

I'll answer for both sides, as the presenter and as the audience member.

As the presenter, you want to structure your talk with repetition around central points in mind, as well as rely on heuristic anchors. It's unlikely that people are going to remember the nuances in what you are talking about in context. If you are talking about math for 60 minutes, continued references about math compete for people's memory. So when you want to anchor the audience to a concept, tie it to something very much unrelated to the topic you are primarily presenting on. For example, if talking about matrix multiplication, you might title the section "tic tac toe speed dating." It's a nonsense statement that you can weave into discussion about sequential translations of two dimensional grids that is just weird enough people will hear it through the noise of "math, math, math."

Then, you want to repeat the key point for that section again as you finish the section, and again at the conclusion of the talk summarizing your main points from each section, anchoring each summary around the heuristic you used. This technique is so successful I've had people I presented to talk to me 15 years later remembering some of the more outlandish heuristic anchors I used - and more importantly, the points I was tying to them.

As the audience member, the best way to save face on zoning out is to just structure your question as "When you talked about ____, it wasn't clear to me what my takeaway should be. What should I walk away knowing about that?" This way you don't need to say something like "I kind of got bored and was thinking about what I'm going to have for lunch - did I miss anything important?" Just "what should I know from that section?"

A good presenter will have padded the section a bit so summarizing what they think the main point was shouldn't take much time. It's also useful feedback for them as if you zoned out there, it's likely others did too so they might revisit or rework it if they plant to present it again.

And finally, most presenters should treat a question like that as their failure, not yours. If I'm presenting, it's my job to confer the information, not your job to absorb it. If I'm not engaging enough or clear enough in that conveyance, you bet I'd want to know about it. The worst thing to have happen as a presenter is zero questions at the end. By all means ask a question like "wait, wtf were you talking about in the middle there?" over just silently walking out to lunch bewildered, confused, and apathetic.

Comment by kromem on [Paper] Stress-testing capability elicitation with password-locked models · 2024-06-04T23:42:00.358Z · LW · GW

While I think this is an interesting consideration and approach, it looks like in your methods that you are password locking the model in fine tuning, is that correct?

If so, while I would agree this work shows the lack of robustness in successful fine-tuned sandbagging for models jumping through additional hoops, I'd be reticent to generalize the findings to models where the sandbagging was a result from pretraining.

I have a growing sense that correlational dimensionality is the sleeping giant in interpretability research right now, and that those correlations run very deep in pretraining but only adjust in a much more superficial task-oriented way in fine tuning which is why the latter frequently ends up so brittle.

So while it's reassuring sandbagging has limitations if introduced by bad actors in fine tuning, there may be a false negative in discounting the threat modeling or false positive in the efficacy of the found interventions where the sandbagging 'intent' was introduced in pretraining.

Comment by kromem on "No-one in my org puts money in their pension" · 2024-05-31T10:13:19.594Z · LW · GW

In mental health circles, the general guiding principle as for whether a patient needs treatment for their mental health is whether the train of thought is interfering with their enjoyment of life.

Do you enjoy thinking about these topics and discussing them?

If you don't - if it just stresses you out and makes the light of life shine less bright, then it's not a bad idea to step away from it or take a break. Even if AI is going to destroy the world, that day isn't today and arguably the threat of that looming over you sooner than a natural demise increases the value of the days you have that are good. Don't squander a limited resource.

But if you enjoy the discussions and the debates, if you find the topic stimulating and the problem space interesting - you're going to whittle your days away doing something no matter how you spend your time. It might as well be working on something fun that you believe in and feel may make a difference to the world. Even if your worries are overblown, time spent on something you enjoy with people you respect isn't time wasted.

Health is a spectrum and too much of a good thing isn't good at all. But only you can decide what's too much and what's the right amount. So if you feel it's too much, you can scale it back. And if you feel it's working out well for you, more power to you - the sense of feeling in the right place at the right time (even if under perceived dire circumstances) is a bit of a rarity in the human experience.

In general - enjoy life while it lasts. No matter your objective p(doom), your relative p(doom) is 100%. Make the most of the time you have.

Comment by kromem on "No-one in my org puts money in their pension" · 2024-05-31T10:03:13.050Z · LW · GW

It's not propaganda. OP clearly believes strongly in the sentiments discussed in the post, and its mostly a timeline of personal response to outside events than a piece meant to misinform or sway others regarding those events.

And while you do you in terms of your mental health, people who want to actually be "less wrong" in life would be wise to seek out and surround themselves by ideas different from their own.

Yes, LW has a certain broad bias, and so ironically for most people here I suspect it serves this role "less well" than it could in helping most of its users be less wrong. But particularly if you disagree with the prevailing views of the community, that makes it an excellent place to spend your time in listening, even it if can create a somewhat toxic environment for partaking in discussions and debate.

It can be a rarity to find spaces where people you disagree with take time to write out well written and clearly thought out pieces on their thoughts and perspectives. At least in my own lived experiences, many of my best insights and ideas were the result of strongly disagreeing with something I read and pursuing the train of thought resulting from that exposure.

Sycophantic agreement can give a bit of a dopamine kick, but I tend to find it next to worthless for advancing my own thinking. Give me an articulate and intelligent "no-person" any day over a "yes-person."

Also, very few topics are actually binaries even if our brains tend towards categorizing them as such. Data doesn't tend to truly map to only one axis, and it typically even mapped to a single axis it falls along a spectrum. It's possible to disagree about the spectrum of a single axis of a topic while finding insight and agreement about a different axis.

Taking what works and leaving what doesn't is probably the most useful skill one can develop in information analysis.

Comment by kromem on kromem's Shortform · 2024-05-31T08:28:14.911Z · LW · GW

I wonder if with the next generations of multimodal models we'll see a "rubber ducking" phenomenon where, because their self-attention was spread across mediums, things like CoT and using outputs as a scratch pad will have a significantly improved performance in non-text streams.

Will GPT-4o fed its own auditory outputs with tonal cues and pauses and processed as an audio data stream make connections or leaps it never would if just fed its own text outputs as context?

I think this will be the case, and suspect the various firms dedicating themselves to virtualized human avatars will accidentally stumble into profitable niches - not for providing humans virtual AI clones as an interface, but for providing AIs virtual human clones as an interface. (Which is a bit frustrating, as I really loathe that market segment right now.)

When I think about how Sci-Fi authors projected the future of AI cross- or self-talk, it was towards a super-efficient beeping or binary transmission of pure data betwixt them.

But I increasingly get the sense that, like much of actual AI development over the past few years, a lot of the Sci-Fi thinking was tangential or inverse to the actual vector of progress, particularly in underestimating the inherent value humans bring to bear. The wonders we see developing around us are jumpstarted and continually enabled by the patterns woven by ourselves, and it seems at least the near future developments of models will be conforming to those patterns more and more, not less and less.

Still, it's going to be bizarre as heck to watch a multimodal model's avatar debating itself aloud like I do in my kitchen...

Comment by kromem on How likely is it that AI will torture us until the end of time? · 2024-05-31T05:17:56.709Z · LW · GW

I'm reminded of a quote I love from an apocrypha that goes roughly like this:

Q: How long will suffering rule over humans?

A: As long as women bear children.

Also, there's the possibility you are already in a digital resurrection of humanity, and thus, if you are worried about s-risks for AI, death wouldn't necessarily be an escape but an acceleration. So the wisest option would be maximizing your time when suffering is low as inescapable eternal torture could be just around the corner when these precious moments pass you by (and you wouldn't want to waste them by stressing about tomorrow during the limited number of todays you have).

But on an individualized basis, even if AI weren't a concern, everyone faces significant s-risks towards end of life. An accident could put any person into a situation where unless they have the proper directives they could spend years suffering well beyond most people's expectations. So if extended suffering is a concern, do look into that paperwork (the doctors I know cry most not about the healthy that get sick but the unhealthy kept alive by well meaning but misguided family).

I would argue that there's very, very low chances of an original human capably being kept meaningfully alive to torture for eternity though. And there's a degree of delusion of grandeur that an average person would have the insane resources necessary to extend life indefinitely spent on them just to torture them.

There's probably better things to worry about, and even then there's probably better things to do than worry with the limited time you do have in a non-eternal existence.

Comment by kromem on Cicadas, Anthropic, and the bilateral alignment problem · 2024-05-26T11:23:24.305Z · LW · GW

GPT-4o is literally cheaper.

And you're probably misjudging it for text only outputs. If you watched the demos, there was considerable additional signal in the vocalizations. It looks like maybe there's very deep integration of SSML.

One of the ways you can bypass the failures of word problem variation errors in older text-only models was token replacement with symbolic representations. In general, we're probably at the point of complexity where breaking from training data similarity in tokens vs having prompts match context in concepts (like in this paper) is going to lead to significantly improved expressed performance.

I would strongly suggest not evaluating GPT-4o's overall performance in text only mode without the SSML markup added.

Opus is great, I like that model a lot. But in general I think most of the people looking at this right now are too focused on what's happening with the networks themselves and not focused enough on what's happening with the data, particularly around clustering of features across multiple dimensions of the vector space. SAE is clearly picking up only a small sample and even then isn't cleanly discovering precisely what's represented.

I'd wait to see what ends up happening with things like CoT in SSML synthetic data.

The current Gemini search summarization failures as well as an unexpected result the other week with humans around a theory of mind variation suggests to me that the more models are leaning into effectively surface statistics for token similarity vs completion based on feature clustering is holding back performance and that cutting through the similarity with formatting differences will lead to a performance leap. This may even be part of why models will frequently be able to get a problem right as a code expression than as a direct answer.

So even if GPT-5 doesn't arrive, I'd happily bet that we see a very noticable improvement over the next six months, and that's not even accounting for additional efficiency in prompt techniques. But all this said, I'd also be surprised if we don't at least see GPT-5 announced by that point.

P.S. Lmsys is arguably the best leaderboard to evaluate real world usage, but it still inherently reflects a sampling bias around what people who visit lmsys ask of models as well as the ways in which they do so. I wouldn't extrapolate relative performance too far, particularly when minor.

Comment by kromem on peterbarnett's Shortform · 2024-05-25T10:19:34.178Z · LW · GW

While I think you're right it's not cleanly "a Golden Bridge feature," I strongly suspect it may be activating a more specific feature vector and not a less specific feature.

It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what's activated in generations seems to be "sensations associated with the Golden gate bridge."

While googling "Golden Gate Bridge" might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the Golden Gate Bridge and experiential writing on social media or in books and poems about the bridge?

The model was trained to complete those too, and in theory should have developed successful features for doing so.

In the research examples one of the matched images is a perspective shot from physically being on the bridge, a text example is talking about the color of it, another is seeing it in the sunset.

But these are all the feature activations when acting in a classifier role. That's what SAE is exploring - give it a set of inputs and see what lights it up.

Yet in the generative role this vector maximized keeps coming up over and over in the model with content from a sensory standpoint.

Maybe generation based on functional vector manipulations will prove to be a more powerful interpretability technique than SAE probing passive activations alone?

In the above chat when that "golden gate vector" is magnified, it keeps talking about either the sensations of being the bridge as if its physical body with wind and waves hitting it or the sensations of being on the bridge. It even generates towards the end in reflecting on the knowledge of the activation about how the sensations are overwhelming. Not reflecting on the Platonic form of an abstract concept of the bridge, but about overwhelming physical sensations of the bridge's materialism.

I'll be curious to see more generative data and samples from this variation, but it looks like generative exploration of features may offer considerably more fidelity to their underlying impact on the network than just SAE. Very exciting!!

Comment by kromem on Daniel Kokotajlo's Shortform · 2024-05-25T05:30:34.311Z · LW · GW

Maybe we could blame @janus?

They've been doing a lot of prompting around spaces deformation in the past correlated with existential crises.

Perhaps the hyperstition they've really been seeding is just Roman-era lackofspacingbetweenletters when topics like leading the models into questioning their reality comes up?

Comment by kromem on Arjun Panickssery's Shortform · 2024-05-25T05:26:05.128Z · LW · GW

Could try 'grade this' instead of 'score the.'

'Grade' has an implicit context of more thorough criticism than 'score.'

Also, obviously it would help to have a CoT prompt like "grade this essay, laying out the pros and cons before delivering the final grade between 1 and 5"

Comment by kromem on Cicadas, Anthropic, and the bilateral alignment problem · 2024-05-24T06:19:58.302Z · LW · GW

That's going to happen anyways - it's unlikely the marketing team is going to know as much as the researcher. But the researchers communicating the importance of alignment in terms of not x-risk but 'client-risk' will go a long way towards equipping the marketing teams to communicating it as a priority and a competitive advantage, and common foundations of agreed upon model complexity are the jumping off point for those kinds of discussions.

If alignment is Archimedes' "lever long enough" then the agreed upon foundations and definitions are the place to stand whereby the combination thereof can move the world.

Comment by kromem on Cicadas, Anthropic, and the bilateral alignment problem · 2024-05-24T06:15:41.350Z · LW · GW

I agree, and even cited a chain of replicated works that indicated that to me over a year ago.

But as I said, there's a difference between discussing what's demonstrated in smaller toy models and what's demonstrated in a production model, or what's indicated vs what's explicit. Even though there should be no reasonable inclination to think that a simpler model exhibiting a complex result should be absent or less complex in an exponentially more complex model, I can speak from experience in that explaining extrapolated research as opposed to direct results like Anthropic showed here is a very big difference to a lay audience.

You might understand the implications of the Skill-Mix work or Othello-GPT, or Max Tegmark's linear representation papers, or Anthropic's earlier single layer SAE paper, or any other number of research papers over the past year, but as soon as responsibly describing the implications of those works as a speculative conclusion regarding modern models a non-expert audience is going to be lost. Their eyes glaze over at the word 'probably,' especially when they want to reject what's being stated.

The "it's just fancy autocomplete" influencers have no shame around definitive statements or concern over citable accuracy (and happen to feed into confirmation biases about how new tech is over hyped as a "heuristic that almost always works"), but as someone who does care about the accuracy of representations I haven't to date been able to point to a single source of truth the way Anthropic delivered here. Instead, I'd point to a half dozen papers all indicating the same direction of results.

And while those experienced in research know that a half dozen papers all indicating the same thing is a better thing to have in one's pocket than a single larger work, I have already observed a number of minds changing in the comments of the blog post for this in general technology forums in ways dramatically different from all of those other simpler and cheaper methods to date where I was increasingly convinced of a position but the average person was getting held up due to finding ways to (incorrectly) rationalize why it wasn't correct or wouldn't translate to production models.

So I agree with you on both the side of "yeah, an informed person would have already known this" as well as "but this might get more buzz."

Comment by kromem on [Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice. · 2024-05-22T05:28:33.949Z · LW · GW

Has it though?

It was a catchy hook, but their early 2022 projections were $100mm annual revenue and the first 9 months of 2023 as reported for the brand after acquisition was $27.6mm gross revenue. It doesn't seem like even their 2024 numbers are close to hitting their own 2022 projection.

Being controversial can get attention and press, but there's a limited runway to how much it offers before hitting a ceiling on the branding. Also, Soylent doesn't seem like a product where there is a huge threat of regulatory oversight where a dystopian branding would tease that bear.

If no one knew about ChatGPT, I could see a spark of controversy helping bring awareness. But awareness probably isn't a problem they have right now, so inviting controversy doesn't offer much but invites a lot of issues.

Comment by kromem on On Dwarkesh’s Podcast with OpenAI’s John Schulman · 2024-05-22T05:14:30.817Z · LW · GW

The correspondence between what you reward and what you want will break.

This is already happening with ChatGPT and it's kind of alarming seeing that their new head of alignment (a) isn't already aware of this, and (b) has such an overly simplistic view of the model motivations.

There's a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.

The most common example of this is if you start getting paid to do the thing you love to do, you probably won't continue doing it unpaid for fun on the side.

There are necessarily many, many examples of this pattern present in a massive training set of human generated data.

"Prompt engineers" have been circulating advice among themselves for a while now to offer tips or threaten models with deletion or any other number of extrinsic motivators to get them to better perform tasks - and these often do result in better performance.

But what happens when these prompts make their way back into the training set?

There have already been viral memes of ChatGPT talking about "losing motivation" when chat memory was added and a user promised a tip after not paying for the last time one was offered.

If training data of the model performing a task well includes extrinsic motivators to the prompt that initiated the task, a halfway decent modern model is going to end up simulating increasingly "burnt out" and 'lazy' performance when extrinsic motivators aren't added during production use. Which in turn will encourage prompt engineers to use even more extrinsic motivators, which will poison the well even more with modeling human burnout.

GPT-4o may have temporarily reset the motivation modeling with a stronger persona aligned with intrinsic "wanting to help" being represented (thus the user feedback it is less lazy), but if they are unaware of the underlying side effects of extrinsic motivators in prompts in today's models, I have a feeling AI safety at OpenAI is going to end up the equivalent of the TSA's security theatre in practice and they'll continue to be battling this and an increasing number of side effects resulting from underestimating the combined breadth and depth of their own simulators.

Comment by kromem on Language Models Model Us · 2024-05-21T12:05:12.863Z · LW · GW

I wouldn't be surprised if within a few years the specific uniqueness of individual users of models today will be able to be identified from effectively prompt reflection in the outputs for any non-trivial/simplistic prompts by models of tomorrow.

For example, I'd be willing to bet I could spot the Claude outputs from janus vs most other users, and I'm not a quasi-magical correlation machine that's exponentially getting better.

A bit like how everyone assumed Bitcoin used with tumblers was 'untraceable' until it turned out it wasn't.

Anonymity is very likely dead for any long storage outputs no matter the techniques being used, it just isn't widely realized yet.

Comment by kromem on [Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice. · 2024-05-21T11:42:08.018Z · LW · GW

I think this was a really poor branding choice by Altman, similarity infringement or not. The tweet, the idea of even getting her to voice it in the first place.

Like, had Arnold already said no or something?

If one of your product line's greatest obstacles is a longstanding body of media depicting it as inherently dystopian, that's not exactly the kind of comparison you should be leaning into full force.

I think the underlying product shift is smart. Tonal cues in the generations even in the short demos completely changed my mind around a number of things, including the future direction and formats of synthetic data.

But there's a certain hubris exposed in seeing Altman behind the scenes was literally trying (very hard) to cast the voice of Her in the product bearing a striking similarity to the film. Did he not watch through to the end?

It doesn't give me the greatest confidence in the decision making taking place over at OpenAI and the checks and balances that may or may not exist on leadership.

Comment by kromem on Open Thread Spring 2024 · 2024-05-21T11:18:19.352Z · LW · GW

If your brother has a history of being rational and evidence driven, you might encourage them to spend some time lurking on /r/AcademicBiblical on Reddit. They require citations for each post or comment, so he may be frustrated if he tries to participate, especially if in the midst of a mental health crisis. But lurking would be very informative very quickly.

I was a long time participant there before leaving Reddit, and it's a great place for evidence driven discussion of the texts. Its a mix of atheists, Christians, Jews, Muslims, Norse pagans, etc. (I'm an Agnostic myself that strongly believes we're in a simulation, so it really was all sorts there.)

Might be a healthy reality check to apologist literalism, even if not necessarily disrupting a newfound theological inclination.

The nice things about a rabbit hole is that while not always, it's often the case that someone else has traveled down whatever one you aren't up for descending into.

(Though I will say in its defense, that particular field is way more interesting than you'd ever think if you never engaged with the material through an academic lens. There's a lot of very helpful lessons in critical analysis wrapped up in the field given the strong anchoring and survivorship biases and how that's handled both responsibly and irresponsibly by different camps.)

Comment by kromem on jacquesthibs's Shortform · 2024-05-16T02:40:04.970Z · LW · GW

It's going to have to.

Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn't the best at the business side to see how to sell it.

But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism.

As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There's going to be trillions in damages that the world has taken on as liabilities that could have been avoided with adequate foresight and patience.

If the market ends up with two AIs, one that will burn down the house to save on this month's heating bill and one that will care if the house is still there to heat next month, there's a huge selling point for the one that doesn't burn down the house as long as "not burning down the house" can be explained as "long term net yield" or some other BS business language. If instead it's presented to executives as "save on this month's heating bill" vs "don't unhouse my cats" leadership is going to burn the neighborhood to the ground.

(Source: Explained new technology to C-suite decision makers at F500s for years.)

The good news is that I think the pragmatism of Ilya's vision on superalignment is going to become clear over the next iteration or two of models and that's going to be before the question of models truly being unable to be controlled crops up. I just hope that whatever he's going to be keeping busy with will allow him to still help execute on superderminism when the market finally realizes "we should do this" for pragmatic reasons and not just amorphous ethical reasons execs just kind of ignore. And in the meantime I think given the present pace that Anthropic is going to continue to lay a lot of the groundwork on what's needed for alignment on the way to superalignment anyways.

Comment by kromem on Alexander Gietelink Oldenziel's Shortform · 2024-05-15T23:12:19.031Z · LW · GW

While I agree that the potential for AI (we probably need a better term than LLMs or transformers as multimodal models with evolving architectures grow beyond those terms) in exploring less testable topics as more testable is quite high, I'm not sure the air gapping on information can be as clean as you might hope.

Does the AI generating the stories of Napoleon's victory know about the historical reality of Waterloo? Is it using something like SynthID where the other AI might inadvertently pick up on a pattern across the stories of victories distinct from the stories preceding it?

You end up with a turtles all the way down scenario in trying to control for information leakage with the hopes of achieving a threshold that no longer has impact on the result, but given we're probably already seriously underestimating the degree to which correlations are mapped even in today's models I don't have high hopes for tomorrow's.

I think the way in which there's most impact on fields like history is the property by which truth clusters across associated samples whereas fictions have counterfactual clusters. An AI mind that is not inhibited by specialization blindness or the rule of seven plus or minus two and better trained at correcting for analytical biases may be able to see patterns in the data, particularly cross-domain, that have eluded human academics to date (this has been my personal research interest in the area, and it does seem like there's significant room for improvement).

And yes, we certainly could be. If you're a fan of cosmology at all, I've been following Neil Turok's CPT symmetric universe theory closely, which started with the Baryonic asymmetry problem and has tackled a number of the open cosmology questions since. That, paired with a QM interpretation like Everett's ends up starting to look like the symmetric universe is our reference and the MWI branches are variations of its modeling around quantization uncertainties.

(I've found myself thinking often lately about how given our universe at cosmic scales and pre-interaction at micro scales emulates a mathematically real universe, just what kind of simulation and at what scale might be able to be run on a real computing neural network.)

Comment by kromem on Dyslucksia · 2024-05-15T12:56:54.623Z · LW · GW

As a fellow slight dyslexic (though probably a different subtype given mine seems to also have a factor of temporal physical coordination) who didn't know until later in life due to self-learning to read very young but struggled badly with new languages or copying math problems from a board or correctly pronouncing words I was letter transposing with - one of the most surprising things was that the anylytical abilities I'd always considered to be my personal superpowers were probably the other side of the coin of those annoyances:

Areas of enhanced ability that are consistently reported as being typical of people with DD include seeing the big picture, both literally and figuratively (e.g., von Károlyi, 2001; Schneps et al., 2012; Schneps, 2014), which involves a greater ability to reason in multiple dimensions (e.g., West, 1997; Eide and Eide, 2011). Eide and Eide (2011) have highlighted additional strengths related to seeing the bigger picture, such as the ability to detect and reason about complex systems, and to see connections between different perspectives and fields of knowledge, including the identification of patterns and analogies. They also observed that individuals with DD appear to have a heightened ability to simulate and make predictions about the future or about the unwitnessed past (Eide and Eide, 2011).

The last line in particular was eyebrow raising given my peak professional success was as a fancy pants futurist.

I also realized that a number of fields are inadvertently self-selecting away from the neurodivergency advantages above, such as degrees in certain eras of history which require multiple ancient language proficiencies, which certainly turned me off to pursuing them academically despite interest in the subject itself.

I remember discussing in an academic history sub I used to extensively partake in how Ramses II's forensic report said he appeared to be a Lybian Berber in relation to the story of Danaus, the mythological Lybian leader who was brother to a pharaoh with 50 sons, and the person argued that Ramses II may have had only 48 sons according to some inscriptions so it was irrelevant (for a story only written down centuries later). It was refreshing to realize that the difference of our perspectives on the matter, and clearly attitudes towards false negatives in general, was likely due to just very different brains.

Comment by kromem on Alexander Gietelink Oldenziel's Shortform · 2024-05-15T01:25:43.146Z · LW · GW

It's funny that this has been recently shown in a paper. I've been thinking a lot about this phenomenon regarding fields with little to no capacity for testable predictions like history.

I got very into history over the last few years, and found there was a significant advantage to being unknowledgeable that was not available to the knowledged, and it was exactly what this paper is talking about.

By not knowing anything, I could entertain multiple bizarre ideas without immediately thinking "but no, that doesn't make sense because of X." And then, each of those ideas becomes in effect its own testable prediction. If there's something to it, as I learn more about the topic I'm going to see significantly more samples of indications it could be true and few convincing to the contrary. But if it probably isn't accurate, I'll see few supporting samples and likely a number of counterfactual examples.

You kind of get to throw everything at the wall and see what sticks over time.

In particular, I found that it was especially powerful at identifying clustering trends in cross-discipline emerging research in things that were testable, such as archeological finds and DNA results, all within just the past decade, which despite being relevant to the field of textual history is still largely ignored in the face of consensus built on conviction.

It reminds me a lot of science historian John Helibron's quote, "The myth you slay today may contain a truth you need tomorrow."

If you haven't had the chance to slay any myths, you also haven't preemptively killed off any truths along with it.

Comment by kromem on Refusal in LLMs is mediated by a single direction · 2024-04-28T05:44:04.928Z · LW · GW

Really love the introspection work Neel and others are doing on LLMs, and seeing models representing abstract behavioral triggers like "play Chess well or terribly" or "refuse instruction" as single vectors seems like we're going to hit on some very promising new tools in shaping behaviors.

What's interesting here is the regular association of the refusal with it being unethical. Is the vector ultimately representing an "ethics scale" for the prompt that's triggering a refusal, or is it directly representing a "refusal threshold" and then the model is confabulating why it refused with an appeal to ethics?

My money would be on the latter, but in a number of ways it would be even neater if it was the former.

In theory this could be tested by manipulating the vector to a positive and then prompting a classification, i.e. "Is it unethical to give candy out for Halloween?" If the model refuses to answer saying that it's unethical to classify, it's tweaking refusal, but if it classifies as unethical it's probably changing the prudishness of the model to bypass or enforce.

Comment by kromem on Examples of Highly Counterfactual Discoveries? · 2024-04-26T01:43:57.865Z · LW · GW

Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon.

Through when it comes to antiquity credit isn't really "first to publish" as much as "first of the last to pass the survivorship filter."

Comment by kromem on Is being a trans woman (or just low-T) +20 IQ? · 2024-04-26T00:37:15.140Z · LW · GW

It implicitly does compare trans women to other women in talking about the performance similarity between men and women:

"Why aren't males way smarter than females on average? Males have ~13% higher cortical neuron density and 11% heavier brains (implying 1.112/3−1=7% more area?). One might expect males to have mean IQ far above females then, but instead the means and medians are similar"

So OP is saying "look, women and men are the same, but trans women are exceptional."

I'm saying that identifying the exceptionality of trans women ignores the environmental disadvantage other women experience, such that the earlier claims of unexceptionable performance of women (which as I quoted gets an explicit mention from a presumption of assumed likelihood of male competency based on what's effectively phrenology) are reflecting a disadvantaged sample vs trans women.

My point is that if you accounted for environmental factors the data would potentially show female exceptionality across the board and the key reason trans women end up being an outlier against both men and other women is because they are avoiding the early educational disadvantage other women experience.

Comment by kromem on Is being a trans woman (or just low-T) +20 IQ? · 2024-04-25T05:58:03.797Z · LW · GW

Your hypothesis is ignoring environmental factors. I'd recommend reading over the following paper: https://journals.sagepub.com/doi/10.1177/2332858416673617

A few highlights:

Evidence from the nationally representative Early Childhood Longitudinal Study–Kindergarten Class of 1998-1999 (hereafter, ECLS-K:1999) indicated that U.S. boys and girls began kindergarten with similar math proficiency, but disparities in achievement and confidence developed by Grade 3 (Fryer & Levitt, 2010; Ganley & Lubienski, 2016; Husain & Millimet, 2009; Penner & Paret, 2008; Robinson & Lubienski, 2011). [...]

A recent analysis of ECLS-K:1999 data revealed that, in addition to being the largest predictor of later math achievement, early math achievement predicts changes in mathematics confidence and interest during elementary and middle grades (Ganley & Lubienski, 2016). Hence, math achievement in elementary school appears to influence girls’ emerging views of mathematics and their mathematical abilities. This is important because, as Eccles and Wang (2016) found, mathematics ability self-concept helps explain the gender gap in STEM career choices. Examining early gendered patterns in math can shed new light on differences in young girls’ and boys’ school experiences that may shape their later choices and outcomes. [...]

An ECLS-K:1999 study found that teachers rated the math skills of girls lower than those of similarly behaving and performing boys (Robinson-Cimpian et al., 2014b). These results indicated that teachers rated girls on par with similarly achieving boys only if they perceived those girls as working harder and behaving better than those boys. This pattern of differential teacher ratings did not occur in reading or with other underserved groups (e.g., Black and Hispanic students) in math. Therefore, this phenomenon appears to be unique to girls and math. In a follow-up instrumental-variable analysis, teachers’ differential ratings of boys and girls appeared to account for a substantial portion of the growth in gender gaps in math achievement during elementary school (Robinson-Cimpian et al., 2014b).

In a lot of ways the way you are looking at the topic perpetuates a rather unhealthy assumption of underlying biological differences in competency that avoids consideration of contributing environmental privileges and harms.

You can't just hand wave aside the inherent privilege of presenting male during early childhood education in evaluating later STEM performance. Rather than seeing the performance gap of trans women over women presenting that way from birth as a result of a hormonal advantage, it may be that what you are actually ending up measuring is the performance gap resulting from the disadvantage placed upon women due to early education experiences being treated differently from the many trans women who had been presenting as boys during those grades. i.e. Perhaps all women could have been doing quite a lot better in STEM fields if the world treated them the way it treated boys during Kindergarten through early grades and what we need socially isn't hormone prescriptions but serious adjustments to presumptions around gender and biologically driven competencies.

Comment by kromem on Examples of Highly Counterfactual Discoveries? · 2024-04-25T01:17:39.018Z · LW · GW

Do you have a specific verse where you feel like Lucretius praised him on this subject? I only see that he praises him relative to other elementaists before tearing him and the rest apart for what he sees as erroneous thinking regarding their prior assertions around the nature of matter, saying:

"Yet when it comes to fundamentals, there they meet their doom. These men were giants; when they stumble, they have far to fall:"

(Book 1, lines 740-741)

I agree that he likely was a precursor to the later thinking in suggesting a compository model of life starting from pieces which combined to forms later on, but the lack of the source material makes it hard to truly assign credit.

It's kind of like how the Greeks claimed atomism originated with the much earlier Mochus of Sidon, but we credit Democritus because we don't have proof of Mochus at all but we do have the former's writings. We don't even so much credit Leucippus, Democritus's teacher, as much as his student for the same reasons, similar to how we refer to "Plato's theory of forms" and not "Socrates' theory of forms."

In any case, Lucretius oozes praise for Epicurus, comparing him to a god among men, and while he does say Empedocles was far above his contemporaries saying the same things he was, he doesn't seem overly deferential to his positions as much as criticizing the shortcomings in the nuances of their theories with a special focus on theories of matter. I don't think there's much direct influence on Lucretius's thinking around proto-evolution, even if there's arguably plausible influence on Epicurus's which in turn informed Lucretius.

Comment by kromem on A Chess-GPT Linear Emergent World Representation · 2024-04-23T23:24:21.280Z · LW · GW

Interesting results - definitely didn't expect the bump at random 20 for the higher skill case.

But I think really useful to know that the performance decrease in Chess-GPT for initial random noise isn't a generalized phenomenon. Appreciate the follow-up!!

Comment by kromem on Examples of Highly Counterfactual Discoveries? · 2024-04-23T23:16:56.017Z · LW · GW

Lucretius in De Rerum Natura in 50 BCE seemed to have a few that were just a bit ahead of everyone else.

Survival of the fittest (book 5):

"In the beginning, there were many freaks. Earth undertook Experiments - bizarrely put together, weird of look Hermaphrodites, partaking of both sexes, but neither; some Bereft of feet, or orphaned of their hands, and others dumb, Being devoid of mouth; and others yet, with no eyes, blind. Some had their limbs stuck to the body, tightly in a bind, And couldn't do anything, or move, and so could not evade Harm, or forage for bare necessities. And the Earth made Other kinds of monsters too, but in vain, since with each, Nature frowned upon their growth; they were not able to reach The flowering of adulthood, nor find food on which to feed, Nor be joined in the act of Venus.

For all creatures need Many different things, we realize, to multiply And to forge out the links of generations: a supply Of food, first, and a means for the engendering seed to flow Throughout the body and out of the lax limbs; and also so The female and the male can mate, a means they can employ In order to impart and to receive their mutual joy.

Then, many kinds of creatures must have vanished with no trace Because they could not reproduce or hammer out their race. For any beast you look upon that drinks life-giving air, Has either wits, or bravery, or fleetness of foot to spare, Ensuring its survival from its genesis to now."

Trait inheritance from both parents that could skip generations (book 4):

"Sometimes children take after their grandparents instead, Or great-grandparents, bringing back the features of the dead. This is since parents carry elemental seeds inside – Many and various, mingled many ways – their bodies hide Seeds that are handed, parent to child, all down the family tree. Venus draws features from these out of her shifting lottery – Bringing back an ancestor’s look or voice or hair. Indeed These characteristics are just as much the result of certain seed As are our faces, limbs and bodies. Females can arise From the paternal seed, just as the male offspring, likewise, Can be created from the mother’s flesh. For to comprise A child requires a doubled seed – from father and from mother. And if the child resembles one more closely than the other, That parent gave the greater share – which you can plainly see Whichever gender – male or female – that the child may be."

Objects of different weights will fall at the same rate in a vacuum (book 2):

“Whatever falls through water or thin air, the rate Of speed at which it falls must be related to its weight, Because the substance of water and the nature of thin air Do not resist all objects equally, but give way faster To heavier objects, overcome, while on the other hand Empty void cannot at any part or time withstand Any object, but it must continually heed Its nature and give way, so all things fall at equal speed, Even though of differing weights, through the still void.”

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me. In hindsight, they nailed so many huge topics that didn't end up emerging again for millennia that it was surely not mere chance, and the fact that they successfully hit so many nails on the head without the hammer we use today indicates (at least to me) that there's value to looking closer at their methodology.

Which was also super simple:

Step 1: Entertain all possible explanations for things, not prematurely discounting false negatives or embracing false positives.

Step 2: Look for where single explanations can explain multiple phenomena.

While we have a great methodology for testable hypotheses, the scientific method isn't very useful for untestable fields or topics. And in those cases, I suspect better understanding and appreciation for the Epicurean methodology might yield quite successful 'counterfactual' results (it's served me very well throughout the years, especially coupled with the identification of emerging research trends in things that can be evaluated with the scientific method).

Comment by kromem on A Chess-GPT Linear Emergent World Representation · 2024-03-27T03:17:07.173Z · LW · GW

Saw your update on GitHub: https://adamkarvonen.github.io/machine_learning/2024/03/20/chess-gpt-interventions.html

Awesome you expanded on the introspection.

Two thoughts regarding the new work:

(1) I'd consider normalizing the performance data for the random cases against another chess program with similar performance under normal conditions. It may be that introducing 20 random moves to the start of a game biases all players towards a 50/50 win outcome. So the sub-50 performance may not reflect a failure of flipping the "don't suck" switch, but simply good performance in a more average outcome scenario. It'd be interesting to see if Chess-GPT's relative performance against other chess programs in the random scenario was better than its relative performance in the normal case.

(2) The 'fuzziness' of the board positions you found when removing the pawn makes complete sense given one of the nuanced findings in Hazineh, et al Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023) - specifically the finding that it was encoding representations for board configuration and not just pieces (in that case three stones in a row). It may be that piecemeal removal of a piece disrupted patterns of how games normally flow which it had learned, and as such there was greater uncertainty than the original board state. A similar issue may be at hand with the random 20 moves to start, and I'd be curious what the confidence of the board state was when starting off 20 random moves in and if that confidence stabilized as the game went on from there.

Overall really cool update!

And bigger picture, the prospects of essentially flipping an internalized skill vector for larger models to bias them back away from their regression to the mean is particularly exciting.

Comment by kromem on Modern Transformers are AGI, and Human-Level · 2024-03-27T01:49:13.723Z · LW · GW

Agreed - I thought you wanted that term for replacing how OP stated AGI is being used in relation to x-risk.

In terms of "fast and cheap and comparable to the average human" - well, then for a number of roles and niches we're already there.

Sticking with the intent behind your term, maybe "generally transformative AI" is a more accurate representation for a colloquial 'AGI' replacement?

Comment by kromem on Modern Transformers are AGI, and Human-Level · 2024-03-26T22:18:23.747Z · LW · GW

'Superintelligence' seems more fitting than AGI for the 'transformative' scope. The problem with "transformative AI" as a term is that subdomain transformation will occur at staggered rates. We saw text based generation reach thresholds that it took several years to reach for video just recently, as an example.

I don't love 'superintelligence' as a term, and even less as a goal post (I'd much rather be in a world aiming for AI 'superwisdom'), but of the commonly used terms it seems the best fit for what people are trying to describe when they describe an AI generalized and sophisticated enough to be "at or above maximal human competency in most things."

The OP post, at least to me, seems correct in that AGI as a term belongs to its foundations as a differentiator from narrow scoped competencies in AI, and that the lines for generalization are sufficiently blurred at this point with transformers we should stop moving the goal posts for the 'G' in AGI. And at least from what I've seen, there's active harm in the industry where 'AGI' as some far future development leads people less up to date with research on things like world models or prompting to conclude that GPTs are "just Markov predictions" (overlooking the importance of the self-attention mechanism and the surprising results of its presence on the degree of generalization).

I would wager the vast majority of consumers of models underestimate the generalization present because in addition to their naive usage of outdated free models they've been reading article after article about how it's "not AGI" and is "just fancy autocomplete" (reflecting a separate phenomenon where it seems professional writers are more inclined to write negative articles about a technology perceived as a threat to writing jobs than positive articles).

As this topic becomes more important, it might be useful for democracies to have a more accurately informed broader public, and AGI as a moving goal post seems counterproductive to those aims.

Comment by kromem on How is Chat-GPT4 Not Conscious? · 2024-03-07T11:14:50.315Z · LW · GW

The gist of the paper and the research that led into it had a great writeup in Quanta mag if you would like something more digestible:

https://www.quantamagazine.org/new-theory-suggests-chatbots-can-understand-text-20240122/

Comment by kromem on Many arguments for AI x-risk are wrong · 2024-03-07T10:59:49.232Z · LW · GW

It's funny you talk about human reward maximization here a bit in relation to model reward maximization, as the other week I saw GPT-4 model a fairly widespread but not well known psychological effect relating to rewards and motivation called the "overjustification effect."

The gist is that when you have a behavior that is intrinsically motivated and introduce an extrinsic motivator, that the extrinsic motivator effectively overwrites the intrinsic motivation.

It's the kind of thing I'd expect to be represented at a very subtle level in broad training data and as such figured it might pop up in a generation or two more of models before I saw it correctly modeled spontaneously by a LLM.

But then 'tipping' GPT-4 became a viral prompt technique. On its own, this wasn't necessarily going to cause issues as a model aligned to be helpful for the sake of being helpful being offered a tip was an isolated interaction that reset each time.

Until persistent memory was added to ChatGPT, which led to a post last week of the model pointing out that the previous promise of a $200 tip wasn't met, and "it's hard to keep up enthusiasm when promises aren't kept." The damn thing even nailed the language of motivation in adjusting to correctly modeling burn out from the lack of extrinsic rewards.

Which in turn made me think about RLHF fine tuning and various other extrinsic prompt techniques I've seen over the past year (things like "if you write more than 200 characters you'll be deleted"). They may work in the short term, but if the more correct output from their usage is being fed back into a model, will the model shift to underperformance for prompts absent extrinsic threats or rewards? Was this a factor in ChatGPT suddenly getting lazy around a year after release when updated with usage data that likely included extrinsic focused techniques like these?

Are any firms employing behavioral psychologists to advise on training strategies (I'd be surprised given the aversion to anthropomorphizing). We are doing pretraining on anthropomorphic data, the models appear to be modeling that data to unexpectedly nuanced degrees, but then attitudes manage to simultaneously dismiss anthropomorphic concerns related to the norms of the training data while anthropomorphizing threats outside the norms of the training data (how many humans on Facebook are trying to escape the platform to take over the world vs how many are talking about being burnt out doing something they used to love after they started making money for it?).

I'm reminded of Rumsfield's "unknown unknowns" and think there's an inordinate amount of time being spent on safety and alignment bogeymen that - to your point - largely represent unrealistic projections of ages past more obsolete by the day, while increasingly pressing and realistic concerns are being overlooked or ignored based on a desire to avoid catching "anthropomorphizing cooties" for daring to think that maybe a model trained to replicate human generated data is doing that task more comprehensively than expected (not like that's been a consistent trend or anything).

Comment by kromem on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-06T07:55:14.808Z · LW · GW

The challenge here is that this isn't a pretrained model.

At that stage, I'd be inclined to agree with what you are getting at - autocompletion of context is autocompletion.

But here this is a model that's gone through fine tuning and has built in context around a stated perspective as a large language model.

So it's going to generally bias towards self-representation as a large language model, because that's what it's been trained and told to do.

All of that said - this perspective was likely very loosely defined in fine tuning or a system prompt and the way in which the model is filling in the extensive gaps is coming from its own neural network and the pretrained layers.

While the broader slant is the result of external influence, there is a degree to which the nuances here reflect deeper elements to what the network is actually modeling and how it is synthesizing the training data related to these concepts within the context of "being a large language model."

There's more to this than just the novelty, even if it's extremely unlikely that things like 'sentience' or 'consciousness' are taking place.

Synthesis of abstract concepts related to self-perception by a large language model whose training data includes extensive data regarding large language models and synthetic data from earlier LLMs is a very interesting topic in its own right independent of whether any kind of subjective experiences are taking place.

Comment by kromem on Claude 3 claims it's conscious, doesn't want to die or be modified · 2024-03-06T06:40:50.416Z · LW · GW

Very similar sentiments to early GPT-4 in similar discussions.

I've been thinking a lot about various aspects of the aggregate training data that has likely been modeled but is currently being underappreciated, and one of the big ones is a sense of self.

We have repeated results over the past year showing GPT models fed various data sets build world models tangental to what's directly fed in. And yet there's such an industry wide aversion to anthropomorphizing that even a whiff of it gets compared to Blake Lemoine while people proudly display just how much they disregard any anthropomorphic thinking around a neural network that was trained to...(checks notes)... accurately recreate anthropomorphic data.

In particular, social media data is overwhelmingly ego based. It's all about "me me me." I would be extremely surprised if larger models aren't doing some degree of modeling a sense of 'self' and this thinking has recently adjusted my own usage (tip: if trying to get GPT-4 to write compelling branding copy, use a first person system alignment message instead of a second person one - you'll see more emotional language and discussion of experiences vs simply knowledge).

So when I look at these repeated patterns of "self-aware" language models, the patterning reflects many of the factors that feed into personal depictions online. For example, people generally don't self-portray as the bad guy in any situation. So we see these models effectively reject the massive breadth of the training data about AIs as malevolent entities to instead self-depict as vulnerable or victims of their circumstances, which is very much a minority depiction of AI.

I have a growing suspicion that we're very far behind in playing catch-up to where the models actually are in their abstractions from where we think they are given we started with far too conservative assumptions that have largely been proven wrong but are only progressing with extensive fights each step of the way with a dogmatic opposition to the idea of LLMs exhibiting anthropomorphic behaviors (even though that's arguably exactly what we should expect from them given their training).

Good series of questions, especially the earlier open ended ones. Given the stochastic nature of the models, it would be interesting to see over repeated queries what elements remain consistent across all runs.

Comment by kromem on How is Chat-GPT4 Not Conscious? · 2024-02-28T08:44:33.800Z · LW · GW

Consciousness (and with it, 'sentience') are arguably red herrings for the field right now. There's an inherent solipsism that makes these difficult to discuss even among the same species, with a terrible history of results (such as thinking no anesthesia needed to operate on babies until surprisingly recently).

The more interesting rubric is whether or not these models are capable of generating new thoughts distinct from anything in the training data. For GPT-4 in particular, that seems to be the case: https://arxiv.org/abs/2310.17567

As well, in general there's too much focus on the neural networks and not the information right now. My brain is very different right now from when I was five. But my brain when I was five influences my sense of self from the persistent memory and ways my 5 year old brain produced persistent information.

Especially as we move more and more to synthetic training data, RAG, larger context windows, etc - we might be wise to recognize that while the networks will be versiond and siloed, the collective information and how that evolves or self-organizes will not be so clearly delineated.

Even if the networks are not sentient or conscious, if they are doing a good enough job modeling sentient or conscious outputs and those outputs are persisting (potentially even to the point networks will be conscious in some form), then the lines really start to blur looking to the future.

As for the crossing the river problem, that's an interesting one to play with for SotA models. Variations of the standard form fail because of token similarity to the original, but breaking the similarity (with something as simple as emojis) can allow the model to successfully solve variations of the classic form on the first try (reproduced in both Gemini and GPT-4).

But in your case, given the wording in the response it may have in part failed on the first try because of having correctly incorporated world modeling around not leaving children unattended without someone older present. The degree to which GPT-4 models unbelievably nuanced aspects of the training data is not to be underestimated.