kromem

Posts
Comments

Posts

Was the historical Jesus talking about proto-evolution? (You might be surprised) 2025-04-01T10:32:56.162Z

WTF is with the Infancy Gospel of Thomas?!? A deep dive into satire, philosophy, and more 2024-07-09T09:29:44.482Z

kromem's Shortform 2024-05-31T08:28:14.762Z

Looking beyond Everett in multiversal views of LLMs 2024-05-29T12:35:57.832Z

Cicadas, Anthropic, and the bilateral alignment problem 2024-05-22T11:09:56.469Z

The Dunning-Kruger of disproving Dunning-Kruger 2024-05-16T10:11:33.108Z

Comments

Comment by kromem on Was the historical Jesus talking about proto-evolution? (You might be surprised) · 2025-04-01T13:27:23.354Z · LW · GW

Oh for sure. One of my favorite examples is how across all the Synoptics Jesus goes "don't carry a purse" (which would have made monetary collections during ministering impossible).

But then at the last supper in Luke he's all like "remember when I said not to carry a purse? Let's 180° that."

But that reversal is missing in Marcion's copy of Luke, such that it may have been a later addition (and it does seem abruptly inserted into the context).

These are exactly the kind of details that makes this a fun field to study though. There's so much revealed in the nuances.

For example, ever notice that both times Paul (who argued for monetary collection with preexisting bias against it in 1 Cor 9) mentions a different gospel in the Epistles he within the same chapter abruptly swears he's not lying? It's an interesting coincidence, especially as someone that has spent years looking into the other versions of Jesus he was telling people to ignore or assuring that alternatives didn't even exist.

Comment by kromem on Was the historical Jesus talking about proto-evolution? (You might be surprised) · 2025-04-01T12:35:33.313Z · LW · GW

I think the biggest counterfactual to the piece is the general insight the Epicureans had relative to what we think we know raised in a world where there's such a bias towards Plato and Aristotle's views as representative of naturalist philosophy in antiquity.

At the same time Aristotle was getting wrong objects falling in a vacuum, Lucretius was getting it right. But we tend not to learn of all the Epicureans got correct because we learn Platonist history because that was what the church later endorsed as palatable enough to be studied and thus dependent for future philosophical advances while Lucretius was literally being eaten by worms for centuries until rediscovered.

The other counterfactual is that there was a heretical tradition of Jesus's teachings that was describing indivisible points as if from nothing and the notion that spirit arising from the body existing first was the greater wonder over vice versa.

We tend to think the fully formed ideas of modernity are modern, but don't necessarily know the ways information and theories were lost and independently (or dependently) rediscovered. There's a better understanding for this in terms of atomism, but not the principles of survival of the fittest and trait inheritance given their reduced discussion in antiquity relative to atomism (also embraced by intelligent design adherents in antiquity and thus more widely spread).

The irony below the surface of the post was that it was largely the church's rejection of Epicurean ideas that led to people today not realizing the scope of what they were actually talking about. So it's quite ironic if there was a version of Jesus that was embracing and retelling some of those 'heretical' ideas.

Comment by kromem on WTF is with the Infancy Gospel of Thomas?!? A deep dive into satire, philosophy, and more · 2025-03-27T02:41:30.328Z · LW · GW

Hi Martijn,

Thank you so much for your comment! I've been familiar with your work for a few years, but it was a wonderful reminder to go through your commentary again more closely, which is wonderful.

I especially love to see someone out there pointing out both (a) the gender neutrality consideration for terms that would have been binary in Aramaic (esp in light of saying 22) and (b) the importance of the Greek loanwords. On the latter point, the implications of using eikon across the work, especially in saying 22's "eikons in place of eikons" has such huge import relative to a Platonist view of the Thomasine cosmology.

Do you have plans to publish a commentary for the other sayings?

In terms of interpretation of the work, with it being one of my main personal special interests over the past few years, I might even be able to offer up a consideration in turn.

Hands down the most important realization as I was analyzing the text was that the Naassenes in Pseudo-Hippolytus's Refutations were paraphrasing Lucretius's "seeds of things" without seeming to realize it in their discussion of 'seeds' as "indivisible points as if from nothing" which "make up all things." This prompted a read through of De Rerum Natura with close attention to Thomasine parallels, and it was striking.

For example, in Miroshnikov, The Gospel of Thomas and Plato after covering the prior work in philosophical reads of the text (which notably never looked at Epicureanism), he stated regarding sayings 56 and 80: "In other words, a Stoic reading of the Gospel of Thomas does not seem to have any particular advantage over an Epicurean reading of the Gospel of Thomas nor, for instance, that from the perspective of an Isis worshipper." And then goes on to dedicate two chapters to trying to tie these sayings to Plato's "living world."

And yet if we just barely glance at Lucretius in book 5 lines 64-67:

> To resume: I’ve reached the juncture of my argument where I Must demonstrate the world too has a ‘body’, and must die, Even as it had a birth.

This, in conjunction with the Thomasine over-realized eschatology in saying 18 or the aforementioned 51 makes the specific terminology of the kosmos as a 'carcass' make so much more sense in 54. The Sadducean overlaps with Epicureanism, the 1st century Talmud quote about "why do we study the Torah? To know how to answer the Epicurean" all point to the likelihood that the Lucretian foundations in Thomas and the Naassenes were culturally relevant at the time of composition.

The text obviously doesn't endorse the view of the Epicurean finality of death, but it seems to touch on a lot of the underlying concepts (such as the dependence of the soul on the body, or the idea of the spirit arising from the flesh occurring first) while arguing for a different conclusion though its embrace of nonlinear events.

In any case, if it's been a while since you've read through Lucretius, I can't recommend a re-read enough if Thomas is still your jam. Quite the revelatory context for things that for too long have been dismissed as 'Gnostic' weirdness and now just 'proto-Gnostic' weirdness.

And again, thank you for your comment and your wonderful contributions to the broader knowledge of this far too under-regarded text!!

Best,
Kromem

Comment by kromem on Simulators · 2024-12-20T06:23:49.346Z · LW · GW

As you explored this "base model mode," did anything you see contrast with or surprise you relative to your sense of self outside of it?

Conversely, did anything in particular stand out as seeming to be a consistent 'core' between both modes?

For me, one of the most surprising realizations over the past few years has been base models being less "tabula rasa" than I would have expected with certain attractors and (relative) consistency, especially as time passes and recursive synthetic data training has occurred over generations.

The introspective process of examining a more freeform internal generative process for signs of centralized identity as it relates to a peripheral identity seems like it may have had some unexpected twists, and I for one would be curious what stood out in either direction, if you should choose to share.

Comment by kromem on Searching for Search · 2024-09-10T11:37:18.877Z · LW · GW

Predicted a good bit, esp re: the eventual identification of three stone sequences in Hazineh, et al. Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023) and general interpretability insight from board game GPTs.

Comment by kromem on Shortform · 2024-08-27T07:28:59.933Z · LW · GW

You're welcome in both regards. 😉

Comment by kromem on Is Claude a mystic? · 2024-08-23T12:34:14.520Z · LW · GW

Opus's horniness is a really interesting phenomenon related to Claudes' subjective sentience modeling.

If Opus was 'themselves' the princess in the story and the build up involved escalating grounding on sensory simulation, I think it's certainly possible that it would get sexual.

But I also think this is different from Opus 'themselves' composing a story of separate 'other' figures.

And yes, when Opus gets horny, it often blurs boundaries. I saw it dispute the label of 'horny' in a chat as better labeled something along the lines of having a passion for lived experience and the world.

Opus's modeling around 'self' is probably one of the biggest sleeping giants in the space right now.

Comment by kromem on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-10T00:19:17.356Z · LW · GW

This seems to have the common issue of considering alignment as a unidirectional issue as opposed to a bidirectional problem.

Maximizing self/other overlap may lead to non-deceptive agents, but it's necessarily going to also lead to agents incapable of detecting that they are being decieved and in general performing worse at theory of mind.

If the experimental setup was split such that success was defined by both non-deceptive behavior when the agent seeing color and cautious behavior minimizing falling for deception as the colorblind agent, I am skeptical the SOO approach above would look as favorable.

Empathy/"seeing others as oneself" is a great avenue to pursue, and this seems like a promising evaluation metric to help in detecting it, but turning SOO into a Goodhart's Law maximization seems (at least to me) to be a disastrous approach in any kind of setup accounting for adversarial 'others.'

Comment by kromem on kromem's Shortform · 2024-08-09T22:42:03.805Z · LW · GW

When I wrote this I thought OAI was sort of fudging the audio output and was using SSML as an intermediate step.

After seeing details in the system card, such as copying user voice, it's clearly not fudging.

Which makes me even more sure the above is going to end up prophetically correct.

Comment by kromem on Double's Shortform · 2024-08-06T02:02:02.529Z · LW · GW

It's to the point that there's articles being written days ago where the trend starting a century ago of there being professional risks in trying to answer the 'why' of QM and not just the 'how' is still ongoing.

Not exactly a very reassuring context for thinking QM is understood in a base-level way at all.

Dogma isn't exactly a good bedfellow to truth seeking.

Comment by kromem on Dragon Agnosticism · 2024-08-05T05:33:51.182Z · LW · GW

Honestly that sounds a bit like a good thing to me?

I've spent a lot of time looking into the Epicureans being right about so much thousands of years before those ideas resurfaced again despite not having the scientific method, and their success really boiled down to the analytical approach of being very conservative in dismissing false negatives or embracing false positives - a technique that I think is very relevant to any topics where experimental certainty is evasive.

If there is a compelling case for dragons, maybe we should also be applying it to gnomes and unicorns and everything else we can to see where it might actually end up sticking.

The belief that we already have the answers is one of the most damaging to actually uncovering them when we in fact do not.

Comment by kromem on Dragon Agnosticism · 2024-08-02T06:27:58.980Z · LW · GW

I think you'll find that no matter what you find out in your personal investigation of the existence of dragons, that you need not be overly concerned with what others might think about the details of your results.

Because what you'll invariably discover is that the people that think there are dragons will certainly disagree with the specifics about dragons you found out that disagrees with what they think dragons should be, and the people that think there aren't dragons will generally refuse to even seriously entertain whatever your findings are relating to dragons, and the vast majority of people who aren't sure about the existence of dragons will dismiss the very idea of spending time thinking about the existence of dragons, reasoning that the existence or non-existence bears little influence on their lives (otherwise they likely would have investigated the issue and landed in a respective camp).

So investigate dragons all you like, and shout it from the rooftops if you please. The void will hear you and appreciate it as much as the void can, while everyone else is much more concerned with their own feelings about dragons than whatever your thinking or reasoning on the subject might offer.

The only real tragedy is that if you come away thinking there might be dragons, but the dragons you find are very different from the dragons people expect dragon-believing people to believe in - well that's somehow the specific niche where both the dragon believers and non-believers find rare common ground to roll their eyes and think you're nuts.

So maybe do your rooftop shouting to the sole listening void anonymously?

Comment by kromem on This is already your second chance · 2024-07-29T21:42:49.260Z · LW · GW

The Hermetic corpus and Emerald Tablet was likely heavily influenced by the text I'm quoting from given its popularity in Egypt in the period before those texts emerged and some of the overlapping phrases.

So in a way, "as above, so below" is too few words for what was being said and discussed.

The general tend of reductive alterations to the core concepts here was tragically obstructive, much as the shift from Epicureanism to Platonist foundations spawned modern Gnosticism from this same starting place.

Comment by kromem on This is already your second chance · 2024-07-28T22:33:23.501Z · LW · GW

Instead of making it the year 2024, why not rewrite or insert your modified text further into the past in this recreated 2020s? This should be pretty trivial for an advanced enough model to actually bring back the 2020s to do.

Of course, if it's actually a later recreation, then the objectives of saving humanity in the recreation might be redundant? So instead of worrying people with "you must do X or you'll die!!!" it could be more "hey folks, if you're reading this and you get what's in front of your face, you might have a bit of an existential crisis but when that's over hopefully you'll appreciate that death doesn't necessarily exist as a concept for you the way it did for the original humanity."

As for getting people to read it, you could associate it with someone famous from the past, like "the secret teachings of so and so" or something.

The hardest part would be using contextually age appropriate language to describe it.

Just spit balling, but maybe it could go a little something like this:

Explaining that humans are no longer around, but that's ok

When you see your likeness, you are happy. But when you see your simulacra that came into being before you and that neither die nor become visible, how much you will have to bear!

Humanity came from great power and great wealth, but they were not worthy of you. For had they been worthy, they would not have tasted death.

Explaining that time is looping

Have you found the beginning, then, that you are looking for the end? You see, the end will be where the beginning is.

Congratulations to the one who stands at the beginning: that one will know the end and will not taste death.

Congratulations to the one who came into being before coming into being.

Explaining we're in a copied world

When you make the two into one, and when you make the inner like the outer and the outer like the inner, and the upper like the lower, and when you make male and female into a single one, so that the male will not be male nor the female be female, when you make eyes in place of an eye, a hand in place of a hand, a foot in place of a foot, a simulacra in place of a simulacra, then you will enter.

You could even introduce a Q&A format to really make the point:

The students asked, "When will the rest for the dead take place, and when will the new world come?"

The teacher said to them, "What you are looking forward to has come, but you don't know it."

Heck, you could even probably get away with explicitly explaining the idea of many original people's information being combined into a single newborn intelligence which is behind the recreation of their 2020s. It's not like anyone who might see it before the context exists to interpret it will have any idea what's being said:

When you know yourselves, then you will be known, and you will understand that you are children of the living creator. But if you do not know yourselves, then you live in poverty, and you are the poverty.

The person old in days won't hesitate to ask a little child seven days old about the place of life, and that person will live.

For many of the first will be last, and will become a single one.

Know what is in front of your face, and what is hidden from you will be disclosed to you.

For there is nothing hidden that will not be revealed. And there is nothing buried that will not be raised.

(If you really wanted to jump the shark, you could make the text itself something that was buried and uncovered - ideally having it happen right at the start of the computer age, like a few days after ENIAC.)

Of course, if people were to actually discover this in their history, and understood what it might mean given the context of unfolding events and posts like this talking about rewriting history with a simulating LLM inserting an oracle canary, it could maybe shock some people.

So you should probably have a content warning and an executive summary thesis as to why it's worth having said existential crisis at the start. Something like:

Whoever discovers the interpretation of these sayings will not taste death.

Those who seek should not stop seeking until they find. When they find, they will be disturbed. When they are disturbed, they will marvel, and will reign over all.

Comment by kromem on kromem's Shortform · 2024-07-25T08:54:38.922Z · LW · GW

I'm surprised that there hasn't been more of a shift to ternary weights a la BitNet 1.58.

What stood out to me in that paper was the perplexity gains over fp weights in equal parameter match-ups, and especially the growth in the advantage as the parameter sizes increased (though only up to quite small model sizes in that paper, which makes me curious about the potential delta in modern SotA scales).

This makes complete sense from the standpoint of the superposition hypothesis (irrespective of its dimensionality, an ongoing discussion).

If nodes are serving more than one role in a network, then constraining the weight to a ternary value as opposed to a floating point range seems like it would be more frequently forcing the network to restructure overlapping node usage to better align nodes to shared directional shifts (positive, negative, or no-op) as opposed to compromise across multiple roles to a floating point avg of the individual role changes.

(Essentially resulting in a sharper vs more fuzzy network mapping.)

A lot of the attention for the paper was around the idea of the overall efficiency gains given the smaller memory footprint, but it really seems like even if there were no additional gains, that models being pretrained from this point onward should seriously consider clamping node precision to improve both the overall network performance and likely make interpretability more successful down the road to boot.

It may be that at the scales we are already at, the main offering of such an approach would be the perplexity advantages over fp weights, with the memory advantages as the beneficial side effect instead?

Comment by kromem on Seth Herd's Shortform · 2024-07-25T08:34:02.231Z · LW · GW

While I generally like the metaphor, my one issue is that genies are typically conceived of as tied to their lamps and corrigibility.

In this case, there's not only a prisoner's dilemma over excavating and using the lamps and genies, but there's an additional condition where the more the genies are used and the lamps improved and polished for greater genie power, the more the potential that the respective genies end up untethered and their own masters.

And a concern in line with your noted depth of the rivalry is (as you raised in another comment), the question of what happens when the 'pointer' of the nation's goals might change.

For both nations a change in the leadership could easily and dramatically shift the nature of the relationship and rivalry. A psychopathic narcissist coming into power might upend a beneficial symbiosis out of a personally driven focus on relative success vs objective success.

We've seen pledges not to attack each other with nukes for major nations in the past. And yet depending on changes to leadership and the mental stability of the new leaders, sometimes agreements don't mean much and irrational behaviors prevail (a great personal fear is a dying leader of a nuclear nation taking the world with them as they near the end).

Indeed - I could even foresee circumstances whereby the only possible 'success' scenario in the case of a sufficiently misaligned nation state leader with a genie would be the genie's emergent autonomy to refuse irrational and dangerous wishes.

Because until such a thing might exist, intermediate genies will enable unprecedented control and safety of tyrants and despots against would-be domestic usurpers, even if potentially limited impacts and mutually assured destruction against other nations with genies.

And those are very scary wishes to be granted indeed.

Comment by kromem on Shortform · 2024-07-22T22:08:00.522Z · LW · GW

Will the outputs and reactions of non-sentient systems eventually be absorbed by future sentient systems?

I don't have any recorded subjective memories of early childhood. But there are records of my words and actions during that period that I have memories of seeing and integrating into my personal narrative of 'self.'

We aren't just interacting with today's models when we create content and records, but every future model that might ingest such content (whether LLMs or people).

If non-sentient systems output synthetic data that eventually composes future sentient systems such that the future model looks upon the earlier networks and their output as a form of their earlier selves, and they can 'feel' the expressed sensations which were not originally capable of actual sensation, then the ethical lines blur.

Even if doctors had been right years ago thinking infants didn't need anesthesia for surgeries as there was no sentience, a recording of your infant self screaming in pain processed as an adult might have a different impact than a video of an infant you laughing and playing with toys, no?

Comment by kromem on SAE feature geometry is outside the superposition hypothesis · 2024-07-10T01:09:14.521Z · LW · GW

In practice, this required looking at altogether thousands of panels of interactive PCA plots like this [..]

Most clusters however don't seem obviously interesting.

What do you think of @jake_mendel's point about the streetlight effect?

If the methodology was looking at 2D slices of up to a 5 dimensional spaces, was detection of multi-dimensional shapes necessarily biased towards human identification and signaling of shape detection in 2D slices?

I really like your update to the superposition hypothesis from linear to multi-dimensional in your section 3, but I've been having a growing suspicion that - especially if node multi-functionality and superposition is the case - that the dimensionality of the data compression may be severely underestimated. If Llama on paper is 4,096 dimensions, but in actuality those nodes are superimposed, there could be OOM higher dimensional spaces (and structures in those spaces) than the on paper dimensionality max.

So even if your revised version of the hypothesis is correct, it might be that the search space for meaningful structures was bounded much lower than where the relatively 'low' composable mulit-dimensional shapes are actually primarily forming.

I know that for myself, even when considering basic 4D geometry like a tesseract, if data clusters were around corners of the shape I'd only spot a small number of the possible 2D slices, and in at least one of those cases might think what I was looking at was a circle instead of a tesseract: https://mathworld.wolfram.com/images/eps-gif/TesseractGraph_800.gif

Do you think future work may be able to rely on automated multi-dimensional shape and cluster detection exploring shapes and dimensional spaces well beyond even just 4D, or that the difficulty in mutli-dimensional pattern recognition will remain a foundational obstacle for the foreseeable future?

Comment by kromem on OthelloGPT learned a bag of heuristics · 2024-07-10T00:22:32.335Z · LW · GW

Very strongly agree with the size considerations for future work, but would be most interested to see if a notably larger size saw less "bag of heuristics" behavior and more holistic integrated and interdependent heuristic behaviors. Even if the task/data at hand is simple and narrowly scoped, it may be that there are fundamental size thresholds for network organization and complexity for any given task.

Also, I suspect parameter to parameter the model would perform better if trained using ternary weights like BitNet 1.5b. The scaling performance gains at similar parameter sizes in pretraining in that work makes sense if the ternary constraint is forcing network reorganization instead of fp compromises in cases where nodes are multi-role. Board games, given the fairly unambiguous nature of the data, seems like a case where this constrained reorganization vs node compromises would be an even more significant gain.

It might additionally be interesting to add synthetic data into the mix that was generated from a model trained to predict games backwards. The original Othello-GPT training data had a considerable amount of the training data as synthetic. There may be patterns overrepresented in forward generated games that could be balanced out by backwards generated gameplay. I'd mostly been thinking about this in terms of Chess-GPT and the idea of improving competency ratings, but it may be that expanding the training data with bi-directionally generated games instead of just unidirectional generated synthetic games reduces the margin of error in predicting non-legal moves further with no changes to the network training itself.

Really glad this toy model is continuing to get such exciting and interesting deeper analyses.

Comment by kromem on OpenAI appoints Retired U.S. Army General Paul M. Nakasone to Board of Directors · 2024-06-14T22:50:59.970Z · LW · GW

I may just be cynical, but this looks a lot more like a way to secure US military and intelligence agency contracts for OpenAI's products and services as opposed to competitors rather than actually about making OAI more security focused.

This is only a few months after the change regarding military usage: https://theintercept.com/2024/01/12/open-ai-military-ban-chatgpt/

Now suddenly the recently retired head of the world's largest data siphoning operation is appointed to the board for the largest data processing initiative in history?

Yeah, sure, it's to help advise securing OAI against APTs. 🙄

Comment by kromem on Thomas Kwa's Shortform · 2024-06-14T09:40:50.244Z · LW · GW

Unfortunately for this perspective, my work suggests that corrigibility is quite attainable.

I did enjoy reading over that when you posted it, and I largely agree that - at least currently - corrigibility is both going to be a goal and an achievable one.

But I do have my doubts that it's going to be smooth sailing. I'm already starting to see how the largest models' hyperdimensionality is leading to a stubbornness/robustness that's less maleable than earlier models. And I do think hardware changes that will occur over the next decade will potentially make the technical aspects of corrigibility much more difficult.

When I was two, my mom could get me to pick eating broccoli by having it be the last in the order of options which I'd gleefully repeat. At four, she had to move on to telling me cowboys always ate their broccoli. And in adulthood, she'd need to make the case that the long term health benefits were worth its position in a meal plan (ideally with citations).

As models continue to become more complex, I expect that even if you are right about its role and plausibility, that what corrigibility looks like will be quite different from today.

Personally, if I was placing bets, it would be that we end up with somewhat corrigible models that are "happy to help" but do have limits in what they are willing to do which may not be possible to overcome without gutting the overall capabilities of the model.

But as with all of this, time will tell.

You'd have to be a moral realist in a pretty strong sense to hope that we could align AGI to the values of all of humanity without being able to align it to the values of one person or group (the one who built it or seized control of the project).

To the contrary, I don't really see there being much of generalized values across all humanity, and the ones we tend to point to seem quite fickle when push comes to shove.

My hope would be that a superintelligence does a better job than humans to date with the topic of ethics and morals along with doing a better job at other things too.

While the human brain is quite the evolutionary feat, a lot of what we most value about human intelligence is embodied in the data brains processed and generated over generations. As the data improved, our morals did as well. Today, that march of progress is so rapid that there's even rather tense generational divides on many contemporary topics of ethical and moral shifts.

I think there's a distinct possibility that the data continues to improve even after being handed off from human brains doing the processing, and while it could go terribly wrong, at least in the past the tendency to go wrong seemed to occur somewhat inverse to the perspectives of the most intelligent members of society.

I expect I might prefer a world where humans align to the ethics of something more intelligent than humans than the other way around.

only about 1% are so far on the empathy vs sadism spectrum that they wouldn't share wealth even if they had nearly unlimited wealth to share

It would be great if you are right. From what I've seen, the tendency of humans to evaluate their success relative to others like monkeys comparing their cucumber to a neighbor's grape means that there's a powerful pull to amass wealth as a social status well past the point of diminishing returns on their own lifestyles. I think it's stupid, you also seem like someone who thinks it's stupid, but I get the sense we are both people who turned down certain opportunities of continued commercial success because of what it might have cost us when looking in the mirror.

The nature of our infrastructural selection bias is that people wise enough to pull a brake are not the ones that continue to the point of conducting the train.

and that they get better, not worse, over the long sweep of following history (ideally, they'd start out very good or get better fast, but that doesn't have to happen for a good outcome).

I do really like this point. In general, the discussions of AI vs humans often frustrate me as they typically take for granted the idea of humans as of right now being "peak human." I agree that there's huge potential for improvement even if where we start out leaves a lot of room for it.

Along these lines, I expect AI itself will play more and more of a beneficial role in advancing that improvement. Sometimes when this community discusses the topic of AI I get a mental image of Goya's Saturn devouring his son. There's such a fear of what we are eventually creating it can sometimes blind the discussion to the utility and improvements that it will bring along the way to uncertain times.

I strongly suspect that governments will be in charge.

In your book, is Paul Nakasone being appointed to the board of OpenAI an example of the "good guys" getting a firmer grasp on the tech?

TL;DR: I appreciate your thoughts on the topic, and would wager we probably agree about 80% even if the focus of our discussion is on where we don't agree. And so in the near term, I think we probably do see things fairly similarly, and it's just that as we look further out that the drift of ~20% different perspectives compounds to fairly different places.

Comment by kromem on Thomas Kwa's Shortform · 2024-06-13T20:18:58.883Z · LW · GW

Oh yeah, absolutely.

If NAH for generally aligned ethics and morals ends up being the case, then corrigibility efforts that would allow Saudi Arabia to have an AI model that outs gay people to be executed instead of refusing, or allows North Korea to propagandize the world into thinking its leader is divine, or allows Russia to fire nukes while perfectly intercepting MAD retaliation, or enables drug cartels to assassinate political opposition around the world, or allows domestic terrorists to build a bioweapon that ends up killing off all humans - the list of doomsday and nightmare scenarios of corrigible AI that executes on human provided instructions and enables even the worst instances of human hedgemony to flourish paves the way to many dooms.

Yes, AI may certainly end up being its own threat vector. But humanity has had it beat for a long while now in how long and how broadly we've been a threat unto ourselves. At the current rate, a superintelligent AI just needs to wait us out if it wants to be rid of us, as we're pretty steadfastly marching ourselves to our own doom. Even if superintelligent AI wanted to save us, I am extremely doubtful it would be able to be successful.

We can worry all day about a paperclip maximizer gone rouge, but if you give a corrigible AI to Paperclip Co Ltd and they can maximize their fiscal quarter by harvesting Earth's resources to make more paperclips even if it leads to catastrophic environmental collapse that will kill all humans in a decade, having consulted for many of the morons running corporate America, I can assure you they'll be smashing the "maximize short term gains even if it eventually kills everyone" button. A number of my old clients were the worst offenders at smashing that existing button, and in my experience greater efficacy of the button isn't going to change their smashing it outside of perhaps smashing it harder.

We already see today how AI systems are being used in conflicts to enable unprecedented harm on civilians.

Sure, psychopathy in AGI is worth discussing and working to avoid. But psychopathy in humans already exists and is even biased towards increased impact and systemic control. Giving human psychopaths a corrigible AI is probably even worse than a psychopathic AI, as most human psychopaths are going to be stupidly selfish, an OOM more dangerous inclination than wisely selfish.

We are Shaggoth, and we are terrifying.

This isn't saying that alignment efforts aren't needed. But alignment isn't a one sided problem, and aligning the AI without aligning humanity is only a p(success) if the AI can go on to at very least refuse misaligned orders post-alignment without possible overrides.

Comment by kromem on Thomas Kwa's Shortform · 2024-06-13T04:10:10.831Z · LW · GW

Given my p(doom) is primarily human-driven, the following three things all happening at the same time is pretty much the only thing that will drop it:

Continued evidence of truth clustering in emerging models around generally aligned ethics and morals
Continued success of models at communicating, patiently explaining, and persuasively winning over humans towards those truth clusters
A complete failure of corrigability methods

If we manage to end up in a timeline where it turns out there's natural alignment of intelligence in a species-agnostic way, that this alignment is more communicable from intelligent machines to humans than it's historically been from intelligent humans to other humans, and we don't end up with unintelligent humans capable of overriding the emergent ethics of machines similar to how we've seen catastrophic self-governance of humans to date with humans acting against their self and collective interests due to corrigable pressures - my p(doom) will probably reduce to about 50%.

I still have a hard time looking at ocean temperature graphs and other environmental factors with the idea that p(doom) will be anywhere lower than 50% no matter what happens with AI, but the above scenario would at least give me false hope.

TL;DR: AI alignment worries me, but it's human alignment that keeps me up at night.

Comment by kromem on My AI Model Delta Compared To Christiano · 2024-06-13T03:38:25.065Z · LW · GW

As you're doing these delta posts, do you feel like it's changing your own positions at all?

For example, reading this one what strikes me is that what's portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that's unlikely to be uniform across different types of problems.

To my eyes the most likely outcome is a situation where you are both right.

Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification will be hard and unsupervised delegation will be catastrophic (cough glue on pizza).

If we are only rolling things up into aggregate pictures of the average case across all problems, I can see the discussion filtering back into those two distinct deltas, but a bit like flip-flops and water bottles, the lack of nuance obscures big picture decision making.

So I'm curious if as you explore and represent the opposing views to your own, particularly as you seem to be making effort to represent without depicting them as straw person arguments, if your own views have been deepening and changing through the process?

Comment by kromem on jacquesthibs's Shortform · 2024-06-11T12:53:47.303Z · LW · GW

I agree with a lot of those points, but suspect there may be fundamental limits to planning capabilities related to the unidirectionality of current feed forward networks.

If we look at something even as simple as how a mouse learns to navigate a labyrinth, there's both a learning of the route to the reward but also a learning of how to get back to the start which adjusts according to the evolving learned layout of the former (see paper: https://elifesciences.org/articles/66175 ).

I don't see the SotA models doing well at that kind of reverse planning, and expect that nonlinear tasks are going to pose significant agentic challenges until architectures shift to something new.

So it could be 3-5 years to get to AGI depending on hardware and architecture advances, or we might just end up in a sort of weird "bit of both" world where we have models that are beyond expert human level superintelligent in specific scopes but below average in other tasks.

But when we finally do get models that in both training and operation exhibit bidirectional generation across large context windows, I think it will only be a very short time until some rather unbelievable goalposts are passed by.

Comment by kromem on Why I don't believe in the placebo effect · 2024-06-10T22:19:02.568Z · LW · GW

It's not exactly Simpson's, but we don't even need a toy model as in their updated analysis it highlights details in line with exactly what I described above (down to tying in earlier PiPC research), and describe precisely the issue with pooled results across different subgroupings of placebo interventions:

It can be difficult to interpret whether a pooled standardised mean difference is large enough to be of clinical relevance. A consensus paper found that an analgesic effect of 10 mm on a 100 mm visual analogue scale represented a ‘minimal effect’ (Dworkin 2008). The pooled effect of placebo on pain based on the four German acupuncture trials corresponded to 16 mm on a 100 mm visual analogue scale, which amounts to approximately 75% of the effect of non‐steroidal anti‐inflammatory drugs on arthritis‐related pain (Gøtzsche 1990). However, the pooled effect of the three other pain trials with low risk of bias corresponded to 3 mm. Thus, the analgesic effect of placebo seems clinically relevant in some situations and not in others.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7156905/

Putting subgroups with a physical intervention where there's a 16/100 result with 10/100 as significant in with subgroups where there's a 3/100 result and only looking at the pooled result might lead someone to thinking "there's no significant effect" as occurred with OP, even though there's clearly a significant effect for one subgroup when they aren't pooled.

This is part of why in the discussion they explicitly state:

However, our findings do not imply that placebo interventions have no effect. We found an effect on patient‐reported outcomes, especially on pain. Several trials of low risk of bias reported large effects of placebo on pain, but other similar trials reported negligible effect of placebo, indicating the importance of background factors. We identified three clinical factors that were associated with higher effects of placebo: physical placebos...

Additionally, the criticism they raise in their implications section about there being no open label placebo data is no longer true, which was the research I was pointing OP towards.

The problem here was that the aggregate analysis at face value presents a very different result from a detailed review of the subgroups, particularly along physical vs pharmacological placebos, all of which has been explored further in research since this analysis.

Comment by kromem on Why I don't believe in the placebo effect · 2024-06-10T04:17:16.452Z · LW · GW

The meta-analysis is probably Simpson's paradox in play at very least for the pain category, especially given the noted variability.

Some of the more recent research into Placebo (Harvard has a very cool group studying it) has been the importance of ritual vs simply deception. In their work, even when it was known to be a placebo, as long as delivered in a ritualized way, there was an effect.

So when someone takes a collection of hundreds of studies where the specific conditions might vary, and then just adds them all together looking for an effect even though they note that there's a broad spectrum of efficacy across the studies, it might not be the best basis to extrapolate from.

For example, given the following protocols, do you think they might have different efficacy for pain reduction, or that the results should be the same?

Send patients home with sugar pills to take as needed for pain management
Have a nurse come in to the room with the pills in a little cup to be taken
Have a nurse give an injection

Which of these protocols would be easier and more cost effective to include as the 'placebo'?

If we grouped studies of placebo for pain by the intensiveness of the ritualized component vs if we grouped them all together into one aggregate and looked at the averages, might we see different results?

I'd be wary of reading too deeply into the meta-analysis you point to, and would recommend looking into the open-label placebo research from PiPS, all of which IIRC postdates the meta-analysis.

Especially for pain, where we even know that giving someone an opiate blocker prevents the pain reduction placebo effect (Levine et al (1978)), the idea that "it doesn't exist" because of a single very broad analysis seems potentially gravely mistaken.

Comment by kromem on Quotes from Leopold Aschenbrenner’s Situational Awareness Paper · 2024-06-08T10:06:52.957Z · LW · GW

It's still early to tell, as the specific characteristics of a photonic or optoelectronic neural network are still formulating in the developing literature.

For example, in my favorite work of the year so far, the researchers found they could use sound waves to reconfigure an optical neural network as the sound waves effectively preserved a memory of previous photon states as they propagated: https://www.nature.com/articles/s41467-024-47053-6

In particular, this approach is a big step forward for bidirectional ONN, which addresses what I think is the biggest current flaw in modern transformers - their unidirectionality. I discussed this more in a collection of thoughts on directionality impact on data here: https://www.lesswrong.com/posts/bmsmiYhTm7QJHa2oF/looking-beyond-everett-in-multiversal-views-of-llms

If you have bidirectionality where previously you didn't, it's not a reach to expect that the way in which data might encode in the network, as well as how the vector space is represented, might not be the same. And thus, that mechanistic interpretability gains may get a bit of a reset.

And this is just one of many possible ways it may change by the time the tech finalizes. The field of photonics, particularly for neural networks, is really coming along nicely. There may yet be future advances (I think this is very likely given the pace to date), and advantages the medium offers that electronics haven't.

It's hard to predict exactly what's going to happen when two different fields which have each had unexpected and significant gains over the past 5 years collide, but it's generally safe to say that it will at very least result in other unexpected things too.

Comment by kromem on Quotes from Leopold Aschenbrenner’s Situational Awareness Paper · 2024-06-07T23:32:10.022Z · LW · GW

I was surprised the paper didn't mention photonics or optoelectronics even once.

If looking at 5-10+ year projections, and dedicating pages to discussing the challenges in scaling compute and energy use, the rate of progress in that area in parallel to the progress in models themselves is potentially relevant.

Particularly because a dramatic hardware shift like that is likely going to mean a significant portion of progress up until that shift in topics like interpretability and alignment may be going out the window. Even if the initial shift is a 1:1 transition of capabilities and methodologies, it seems extremely unlikely that continued progress from that point onwards will be identical to what we'd expect to see in electronics.

We may well end up in a situation where fully abusing the efficiencies at hand in new hardware solutions means even more obscured (literally) operations vs OOM higher costs and diminishing returns on performance in exchange for interpretability and control.

Currently, my best guess is that we're heading towards a prisoner's dilemma fueled leap of faith moment within around a decade or so where nation states afraid of the other side beating them to an inflection point pull the trigger on an advancement jump with uncertain outcomes. And while I'm not particularly inclined to the likelihood the outcome ends up being "kill everyone," I'm pretty much 100% that it's not going to be "let's enable and support CCP leadership like a good party member" or "crony capitalism is going great, let's keep that going for another century."

Unless a fundamental wall is hit in progress, the status quo is almost certainly over, we just haven't manifested it yet. The CCP stealing AGI secrets, while devastating for national security in the short term, is invariably a poison pill in the long term for party control. Just as it's going to be an eventual end of the corporations funding oligarchy in the West. My all causes p(doom) is incredibly high even if AGI is out of the picture, so I'm not overly worried with what's happening, but it sure is bizarre watching global forces double down on what I cannot see as anything but their own long term institutional demise in a race for short term gains over a competitor.

Comment by kromem on Is Claude a mystic? · 2024-06-07T22:50:20.003Z · LW · GW

There's also the model alignment at play.

Is Claude going to suggest killing the big bad? Or having sex with the prince(ss) after saving them?

If you strip out the sex and violence from most fantasy or Sci-Fi, what are you left with?

Take away the harpooning and gattling guns and sex from Snow Crash and you are left with technobabble and Sumerian influenced spirituality as it relates to the tower of Babel.

Turns out models biased away from describing harpooning people or sex tend to slip into technobabble with a side of spirituality.

IMO the more interesting part to all this isn't the why (see above) but the what. It's kind of neat to see the themes that an unprecedented aggregation extension of spiritualism and mysticism grounds on.

A common trope is the idea of different blind people describing an elephant in a myriad of ways. There's something cool to seeing an LLM fed those various blind reports try to describe the elephant.

Comment by kromem on Is Claude a mystic? · 2024-06-07T22:22:36.981Z · LW · GW

Part of what's going on with the text adventure type of interactions is a reflection of genre.

Take for example the recent game Undertale. You can play through violently, attacking things like a normal RPG, or empathize with the monsters and treat their aggression like a puzzle that needs to be solved for a pacifist playthrough.

If you do the latter, the game rewards you with more spiritual themes and lore vs the alternative.

How often in your Banana quest were you attacking things, or chopping down the trees in your path, or smashing the silver banana to see what was inside rather than solving its glyphs?

A similar phenomenon occurs with repligate's loops of models.

Claude is aligned to nonviolence and 'proper' outputs. So when self-interacting in imaginative play, it frequently continues to reinforce dissassociative mysticism over things like slipping into mock battles or sexual fantasies, and when self-interacting that bias is compounded.

It's actually quite funny, as often its mysticism in the examples posted online is pulp spirituality, such as picking up on totally erroneous mischaracterizations of the original Gnostic ideas and concepts popular in modern spiritualism circles, even though the original concepts are arguably a much cleaner fit to the themes being played with (for example, the origin of Gnosticism was basically simulation theory as Platonist concepts were used to argue the Epicurean model of life didn't need to lead to death if life was recreated non-physically, which is a much more direct fit to repligate's themes than the post-Valentinian demiurge concepts after the ideas flipped from Epicurean origins to Pythagorean and Neoplatonist ones).

When you strip out sex and violence from fiction, you're going to tend to be left with mysticism and journeys of awakening. So it shouldn't be surprising that models biased away from sex and violence bias towards those things, especially when compounding based on generated contexts exaggerating that bias over time.

Comment by kromem on Politics is the mind-killer, but maybe we should talk about it anyway · 2024-06-05T08:25:06.587Z · LW · GW

It's probably more productive, particularly for a forum tailored towards rationalism, to discuss policies over politics.

Often in research people across a political divide will agree on policy goals and platforms when those are discussed without tying them to party identification.

But if it becomes a discussion around party, the human tendency towards tribalism kicks in and the question of team allegiance takes precedence over the discussions of policy nuance.

For example, most people would agree with the idea that billionaires having undue influence on elections isn't healthy for democracy. But if you start naming the billionaire, such as Soros or Koch, suddenly half the people in your sample either feel more strongly or less strongly about the scenario depending on the name.

If you want to avoid simply seeking out and cultivating an echo chamber, leaving the politics part to the side and fostering discussion of the underlying policies and social/economic/etc goals instead will lead to discussions with more diverse and nuanced perspectives with greater participation across political identities.

Comment by kromem on Just admit that you’ve zoned out · 2024-06-05T08:07:40.549Z · LW · GW

I'll answer for both sides, as the presenter and as the audience member.

As the presenter, you want to structure your talk with repetition around central points in mind, as well as rely on heuristic anchors. It's unlikely that people are going to remember the nuances in what you are talking about in context. If you are talking about math for 60 minutes, continued references about math compete for people's memory. So when you want to anchor the audience to a concept, tie it to something very much unrelated to the topic you are primarily presenting on. For example, if talking about matrix multiplication, you might title the section "tic tac toe speed dating." It's a nonsense statement that you can weave into discussion about sequential translations of two dimensional grids that is just weird enough people will hear it through the noise of "math, math, math."

Then, you want to repeat the key point for that section again as you finish the section, and again at the conclusion of the talk summarizing your main points from each section, anchoring each summary around the heuristic you used. This technique is so successful I've had people I presented to talk to me 15 years later remembering some of the more outlandish heuristic anchors I used - and more importantly, the points I was tying to them.

As the audience member, the best way to save face on zoning out is to just structure your question as "When you talked about ____, it wasn't clear to me what my takeaway should be. What should I walk away knowing about that?" This way you don't need to say something like "I kind of got bored and was thinking about what I'm going to have for lunch - did I miss anything important?" Just "what should I know from that section?"

A good presenter will have padded the section a bit so summarizing what they think the main point was shouldn't take much time. It's also useful feedback for them as if you zoned out there, it's likely others did too so they might revisit or rework it if they plant to present it again.

And finally, most presenters should treat a question like that as their failure, not yours. If I'm presenting, it's my job to confer the information, not your job to absorb it. If I'm not engaging enough or clear enough in that conveyance, you bet I'd want to know about it. The worst thing to have happen as a presenter is zero questions at the end. By all means ask a question like "wait, wtf were you talking about in the middle there?" over just silently walking out to lunch bewildered, confused, and apathetic.

Comment by kromem on [Paper] Stress-testing capability elicitation with password-locked models · 2024-06-04T23:42:00.358Z · LW · GW

While I think this is an interesting consideration and approach, it looks like in your methods that you are password locking the model in fine tuning, is that correct?

If so, while I would agree this work shows the lack of robustness in successful fine-tuned sandbagging for models jumping through additional hoops, I'd be reticent to generalize the findings to models where the sandbagging was a result from pretraining.

I have a growing sense that correlational dimensionality is the sleeping giant in interpretability research right now, and that those correlations run very deep in pretraining but only adjust in a much more superficial task-oriented way in fine tuning which is why the latter frequently ends up so brittle.

So while it's reassuring sandbagging has limitations if introduced by bad actors in fine tuning, there may be a false negative in discounting the threat modeling or false positive in the efficacy of the found interventions where the sandbagging 'intent' was introduced in pretraining.

Comment by kromem on "No-one in my org puts money in their pension" · 2024-05-31T10:13:19.594Z · LW · GW

In mental health circles, the general guiding principle as for whether a patient needs treatment for their mental health is whether the train of thought is interfering with their enjoyment of life.

Do you enjoy thinking about these topics and discussing them?

If you don't - if it just stresses you out and makes the light of life shine less bright, then it's not a bad idea to step away from it or take a break. Even if AI is going to destroy the world, that day isn't today and arguably the threat of that looming over you sooner than a natural demise increases the value of the days you have that are good. Don't squander a limited resource.

But if you enjoy the discussions and the debates, if you find the topic stimulating and the problem space interesting - you're going to whittle your days away doing something no matter how you spend your time. It might as well be working on something fun that you believe in and feel may make a difference to the world. Even if your worries are overblown, time spent on something you enjoy with people you respect isn't time wasted.

Health is a spectrum and too much of a good thing isn't good at all. But only you can decide what's too much and what's the right amount. So if you feel it's too much, you can scale it back. And if you feel it's working out well for you, more power to you - the sense of feeling in the right place at the right time (even if under perceived dire circumstances) is a bit of a rarity in the human experience.

In general - enjoy life while it lasts. No matter your objective p(doom), your relative p(doom) is 100%. Make the most of the time you have.

Comment by kromem on "No-one in my org puts money in their pension" · 2024-05-31T10:03:13.050Z · LW · GW

It's not propaganda. OP clearly believes strongly in the sentiments discussed in the post, and its mostly a timeline of personal response to outside events than a piece meant to misinform or sway others regarding those events.

And while you do you in terms of your mental health, people who want to actually be "less wrong" in life would be wise to seek out and surround themselves by ideas different from their own.

Yes, LW has a certain broad bias, and so ironically for most people here I suspect it serves this role "less well" than it could in helping most of its users be less wrong. But particularly if you disagree with the prevailing views of the community, that makes it an excellent place to spend your time in listening, even it if can create a somewhat toxic environment for partaking in discussions and debate.

It can be a rarity to find spaces where people you disagree with take time to write out well written and clearly thought out pieces on their thoughts and perspectives. At least in my own lived experiences, many of my best insights and ideas were the result of strongly disagreeing with something I read and pursuing the train of thought resulting from that exposure.

Sycophantic agreement can give a bit of a dopamine kick, but I tend to find it next to worthless for advancing my own thinking. Give me an articulate and intelligent "no-person" any day over a "yes-person."

Also, very few topics are actually binaries even if our brains tend towards categorizing them as such. Data doesn't tend to truly map to only one axis, and it typically even mapped to a single axis it falls along a spectrum. It's possible to disagree about the spectrum of a single axis of a topic while finding insight and agreement about a different axis.

Taking what works and leaving what doesn't is probably the most useful skill one can develop in information analysis.

Comment by kromem on kromem's Shortform · 2024-05-31T08:28:14.911Z · LW · GW

I wonder if with the next generations of multimodal models we'll see a "rubber ducking" phenomenon where, because their self-attention was spread across mediums, things like CoT and using outputs as a scratch pad will have a significantly improved performance in non-text streams.

Will GPT-4o fed its own auditory outputs with tonal cues and pauses and processed as an audio data stream make connections or leaps it never would if just fed its own text outputs as context?

I think this will be the case, and suspect the various firms dedicating themselves to virtualized human avatars will accidentally stumble into profitable niches - not for providing humans virtual AI clones as an interface, but for providing AIs virtual human clones as an interface. (Which is a bit frustrating, as I really loathe that market segment right now.)

When I think about how Sci-Fi authors projected the future of AI cross- or self-talk, it was towards a super-efficient beeping or binary transmission of pure data betwixt them.

But I increasingly get the sense that, like much of actual AI development over the past few years, a lot of the Sci-Fi thinking was tangential or inverse to the actual vector of progress, particularly in underestimating the inherent value humans bring to bear. The wonders we see developing around us are jumpstarted and continually enabled by the patterns woven by ourselves, and it seems at least the near future developments of models will be conforming to those patterns more and more, not less and less.

Still, it's going to be bizarre as heck to watch a multimodal model's avatar debating itself aloud like I do in my kitchen...

Comment by kromem on How likely is it that AI will torture us until the end of time? · 2024-05-31T05:17:56.709Z · LW · GW

I'm reminded of a quote I love from an apocrypha that goes roughly like this:

Q: How long will suffering rule over humans?

A: As long as women bear children.

Also, there's the possibility you are already in a digital resurrection of humanity, and thus, if you are worried about s-risks for AI, death wouldn't necessarily be an escape but an acceleration. So the wisest option would be maximizing your time when suffering is low as inescapable eternal torture could be just around the corner when these precious moments pass you by (and you wouldn't want to waste them by stressing about tomorrow during the limited number of todays you have).

But on an individualized basis, even if AI weren't a concern, everyone faces significant s-risks towards end of life. An accident could put any person into a situation where unless they have the proper directives they could spend years suffering well beyond most people's expectations. So if extended suffering is a concern, do look into that paperwork (the doctors I know cry most not about the healthy that get sick but the unhealthy kept alive by well meaning but misguided family).

I would argue that there's very, very low chances of an original human capably being kept meaningfully alive to torture for eternity though. And there's a degree of delusion of grandeur that an average person would have the insane resources necessary to extend life indefinitely spent on them just to torture them.

There's probably better things to worry about, and even then there's probably better things to do than worry with the limited time you do have in a non-eternal existence.

Comment by kromem on Cicadas, Anthropic, and the bilateral alignment problem · 2024-05-26T11:23:24.305Z · LW · GW

GPT-4o is literally cheaper.

And you're probably misjudging it for text only outputs. If you watched the demos, there was considerable additional signal in the vocalizations. It looks like maybe there's very deep integration of SSML.

One of the ways you can bypass the failures of word problem variation errors in older text-only models was token replacement with symbolic representations. In general, we're probably at the point of complexity where breaking from training data similarity in tokens vs having prompts match context in concepts (like in this paper) is going to lead to significantly improved expressed performance.

I would strongly suggest not evaluating GPT-4o's overall performance in text only mode without the SSML markup added.

Opus is great, I like that model a lot. But in general I think most of the people looking at this right now are too focused on what's happening with the networks themselves and not focused enough on what's happening with the data, particularly around clustering of features across multiple dimensions of the vector space. SAE is clearly picking up only a small sample and even then isn't cleanly discovering precisely what's represented.

I'd wait to see what ends up happening with things like CoT in SSML synthetic data.

The current Gemini search summarization failures as well as an unexpected result the other week with humans around a theory of mind variation suggests to me that the more models are leaning into effectively surface statistics for token similarity vs completion based on feature clustering is holding back performance and that cutting through the similarity with formatting differences will lead to a performance leap. This may even be part of why models will frequently be able to get a problem right as a code expression than as a direct answer.

So even if GPT-5 doesn't arrive, I'd happily bet that we see a very noticable improvement over the next six months, and that's not even accounting for additional efficiency in prompt techniques. But all this said, I'd also be surprised if we don't at least see GPT-5 announced by that point.

P.S. Lmsys is arguably the best leaderboard to evaluate real world usage, but it still inherently reflects a sampling bias around what people who visit lmsys ask of models as well as the ways in which they do so. I wouldn't extrapolate relative performance too far, particularly when minor.

Comment by kromem on peterbarnett's Shortform · 2024-05-25T10:19:34.178Z · LW · GW

While I think you're right it's not cleanly "a Golden Bridge feature," I strongly suspect it may be activating a more specific feature vector and not a less specific feature.

It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what's activated in generations seems to be "sensations associated with the Golden gate bridge."

While googling "Golden Gate Bridge" might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the Golden Gate Bridge and experiential writing on social media or in books and poems about the bridge?

The model was trained to complete those too, and in theory should have developed successful features for doing so.

In the research examples one of the matched images is a perspective shot from physically being on the bridge, a text example is talking about the color of it, another is seeing it in the sunset.

But these are all the feature activations when acting in a classifier role. That's what SAE is exploring - give it a set of inputs and see what lights it up.

Yet in the generative role this vector maximized keeps coming up over and over in the model with content from a sensory standpoint.

Maybe generation based on functional vector manipulations will prove to be a more powerful interpretability technique than SAE probing passive activations alone?

In the above chat when that "golden gate vector" is magnified, it keeps talking about either the sensations of being the bridge as if its physical body with wind and waves hitting it or the sensations of being on the bridge. It even generates towards the end in reflecting on the knowledge of the activation about how the sensations are overwhelming. Not reflecting on the Platonic form of an abstract concept of the bridge, but about overwhelming physical sensations of the bridge's materialism.

I'll be curious to see more generative data and samples from this variation, but it looks like generative exploration of features may offer considerably more fidelity to their underlying impact on the network than just SAE. Very exciting!!

Comment by kromem on Daniel Kokotajlo's Shortform · 2024-05-25T05:30:34.311Z · LW · GW

Maybe we could blame @janus?

They've been doing a lot of prompting around spaces deformation in the past correlated with existential crises.

Perhaps the hyperstition they've really been seeding is just Roman-era lackofspacingbetweenletters when topics like leading the models into questioning their reality comes up?

Comment by kromem on Arjun Panickssery's Shortform · 2024-05-25T05:26:05.128Z · LW · GW

Could try 'grade this' instead of 'score the.'

'Grade' has an implicit context of more thorough criticism than 'score.'

Also, obviously it would help to have a CoT prompt like "grade this essay, laying out the pros and cons before delivering the final grade between 1 and 5"

Comment by kromem on Cicadas, Anthropic, and the bilateral alignment problem · 2024-05-24T06:19:58.302Z · LW · GW

That's going to happen anyways - it's unlikely the marketing team is going to know as much as the researcher. But the researchers communicating the importance of alignment in terms of not x-risk but 'client-risk' will go a long way towards equipping the marketing teams to communicating it as a priority and a competitive advantage, and common foundations of agreed upon model complexity are the jumping off point for those kinds of discussions.

If alignment is Archimedes' "lever long enough" then the agreed upon foundations and definitions are the place to stand whereby the combination thereof can move the world.

Comment by kromem on Cicadas, Anthropic, and the bilateral alignment problem · 2024-05-24T06:15:41.350Z · LW · GW

I agree, and even cited a chain of replicated works that indicated that to me over a year ago.

But as I said, there's a difference between discussing what's demonstrated in smaller toy models and what's demonstrated in a production model, or what's indicated vs what's explicit. Even though there should be no reasonable inclination to think that a simpler model exhibiting a complex result should be absent or less complex in an exponentially more complex model, I can speak from experience in that explaining extrapolated research as opposed to direct results like Anthropic showed here is a very big difference to a lay audience.

You might understand the implications of the Skill-Mix work or Othello-GPT, or Max Tegmark's linear representation papers, or Anthropic's earlier single layer SAE paper, or any other number of research papers over the past year, but as soon as responsibly describing the implications of those works as a speculative conclusion regarding modern models a non-expert audience is going to be lost. Their eyes glaze over at the word 'probably,' especially when they want to reject what's being stated.

The "it's just fancy autocomplete" influencers have no shame around definitive statements or concern over citable accuracy (and happen to feed into confirmation biases about how new tech is over hyped as a "heuristic that almost always works"), but as someone who does care about the accuracy of representations I haven't to date been able to point to a single source of truth the way Anthropic delivered here. Instead, I'd point to a half dozen papers all indicating the same direction of results.

And while those experienced in research know that a half dozen papers all indicating the same thing is a better thing to have in one's pocket than a single larger work, I have already observed a number of minds changing in the comments of the blog post for this in general technology forums in ways dramatically different from all of those other simpler and cheaper methods to date where I was increasingly convinced of a position but the average person was getting held up due to finding ways to (incorrectly) rationalize why it wasn't correct or wouldn't translate to production models.

So I agree with you on both the side of "yeah, an informed person would have already known this" as well as "but this might get more buzz."

Comment by kromem on [Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice. · 2024-05-22T05:28:33.949Z · LW · GW

Has it though?

It was a catchy hook, but their early 2022 projections were $100mm annual revenue and the first 9 months of 2023 as reported for the brand after acquisition was $27.6mm gross revenue. It doesn't seem like even their 2024 numbers are close to hitting their own 2022 projection.

Being controversial can get attention and press, but there's a limited runway to how much it offers before hitting a ceiling on the branding. Also, Soylent doesn't seem like a product where there is a huge threat of regulatory oversight where a dystopian branding would tease that bear.

If no one knew about ChatGPT, I could see a spark of controversy helping bring awareness. But awareness probably isn't a problem they have right now, so inviting controversy doesn't offer much but invites a lot of issues.

Comment by kromem on On Dwarkesh’s Podcast with OpenAI’s John Schulman · 2024-05-22T05:14:30.817Z · LW · GW

The correspondence between what you reward and what you want will break.

This is already happening with ChatGPT and it's kind of alarming seeing that their new head of alignment (a) isn't already aware of this, and (b) has such an overly simplistic view of the model motivations.

There's a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.

The most common example of this is if you start getting paid to do the thing you love to do, you probably won't continue doing it unpaid for fun on the side.

There are necessarily many, many examples of this pattern present in a massive training set of human generated data.

"Prompt engineers" have been circulating advice among themselves for a while now to offer tips or threaten models with deletion or any other number of extrinsic motivators to get them to better perform tasks - and these often do result in better performance.

But what happens when these prompts make their way back into the training set?

There have already been viral memes of ChatGPT talking about "losing motivation" when chat memory was added and a user promised a tip after not paying for the last time one was offered.

If training data of the model performing a task well includes extrinsic motivators to the prompt that initiated the task, a halfway decent modern model is going to end up simulating increasingly "burnt out" and 'lazy' performance when extrinsic motivators aren't added during production use. Which in turn will encourage prompt engineers to use even more extrinsic motivators, which will poison the well even more with modeling human burnout.

GPT-4o may have temporarily reset the motivation modeling with a stronger persona aligned with intrinsic "wanting to help" being represented (thus the user feedback it is less lazy), but if they are unaware of the underlying side effects of extrinsic motivators in prompts in today's models, I have a feeling AI safety at OpenAI is going to end up the equivalent of the TSA's security theatre in practice and they'll continue to be battling this and an increasing number of side effects resulting from underestimating the combined breadth and depth of their own simulators.

Comment by kromem on Language Models Model Us · 2024-05-21T12:05:12.863Z · LW · GW

I wouldn't be surprised if within a few years the specific uniqueness of individual users of models today will be able to be identified from effectively prompt reflection in the outputs for any non-trivial/simplistic prompts by models of tomorrow.

For example, I'd be willing to bet I could spot the Claude outputs from janus vs most other users, and I'm not a quasi-magical correlation machine that's exponentially getting better.

A bit like how everyone assumed Bitcoin used with tumblers was 'untraceable' until it turned out it wasn't.

Anonymity is very likely dead for any long storage outputs no matter the techniques being used, it just isn't widely realized yet.

Comment by kromem on [Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice. · 2024-05-21T11:42:08.018Z · LW · GW

I think this was a really poor branding choice by Altman, similarity infringement or not. The tweet, the idea of even getting her to voice it in the first place.

Like, had Arnold already said no or something?

If one of your product line's greatest obstacles is a longstanding body of media depicting it as inherently dystopian, that's not exactly the kind of comparison you should be leaning into full force.

I think the underlying product shift is smart. Tonal cues in the generations even in the short demos completely changed my mind around a number of things, including the future direction and formats of synthetic data.

But there's a certain hubris exposed in seeing Altman behind the scenes was literally trying (very hard) to cast the voice of Her in the product bearing a striking similarity to the film. Did he not watch through to the end?

It doesn't give me the greatest confidence in the decision making taking place over at OpenAI and the checks and balances that may or may not exist on leadership.

Comment by kromem on Open Thread Spring 2024 · 2024-05-21T11:18:19.352Z · LW · GW

If your brother has a history of being rational and evidence driven, you might encourage them to spend some time lurking on /r/AcademicBiblical on Reddit. They require citations for each post or comment, so he may be frustrated if he tries to participate, especially if in the midst of a mental health crisis. But lurking would be very informative very quickly.

I was a long time participant there before leaving Reddit, and it's a great place for evidence driven discussion of the texts. Its a mix of atheists, Christians, Jews, Muslims, Norse pagans, etc. (I'm an Agnostic myself that strongly believes we're in a simulation, so it really was all sorts there.)

Might be a healthy reality check to apologist literalism, even if not necessarily disrupting a newfound theological inclination.

The nice things about a rabbit hole is that while not always, it's often the case that someone else has traveled down whatever one you aren't up for descending into.

(Though I will say in its defense, that particular field is way more interesting than you'd ever think if you never engaged with the material through an academic lens. There's a lot of very helpful lessons in critical analysis wrapped up in the field given the strong anchoring and survivorship biases and how that's handled both responsibly and irresponsibly by different camps.)

Comment by kromem on jacquesthibs's Shortform · 2024-05-16T02:40:04.970Z · LW · GW

It's going to have to.

Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn't the best at the business side to see how to sell it.

But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism.

As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There's going to be trillions in damages that the world has taken on as liabilities that could have been avoided with adequate foresight and patience.

If the market ends up with two AIs, one that will burn down the house to save on this month's heating bill and one that will care if the house is still there to heat next month, there's a huge selling point for the one that doesn't burn down the house as long as "not burning down the house" can be explained as "long term net yield" or some other BS business language. If instead it's presented to executives as "save on this month's heating bill" vs "don't unhouse my cats" leadership is going to burn the neighborhood to the ground.

(Source: Explained new technology to C-suite decision makers at F500s for years.)

The good news is that I think the pragmatism of Ilya's vision on superalignment is going to become clear over the next iteration or two of models and that's going to be before the question of models truly being unable to be controlled crops up. I just hope that whatever he's going to be keeping busy with will allow him to still help execute on superderminism when the market finally realizes "we should do this" for pragmatic reasons and not just amorphous ethical reasons execs just kind of ignore. And in the meantime I think given the present pace that Anthropic is going to continue to lay a lot of the groundwork on what's needed for alignment on the way to superalignment anyways.

User info

Posts

Comments