Posts

Does the Universal Geometry of Embeddings paper have big implications for interpretability? 2025-05-26T18:20:48.111Z
Evan R. Murphy's Shortform 2025-02-28T00:56:55.873Z
Steven Pinker on ChatGPT and AGI (Feb 2023) 2023-03-05T21:34:14.846Z
Steering Behaviour: Testing for (Non-)Myopia in Language Models 2022-12-05T20:28:33.025Z
Paper: Large Language Models Can Self-improve [Linkpost] 2022-10-02T01:29:00.181Z
Google AI integrates PaLM with robotics: SayCan update [Linkpost] 2022-08-24T20:54:34.438Z
Surprised by ELK report's counterexample to Debate, IDA 2022-08-04T02:12:15.139Z
New US Senate Bill on X-Risk Mitigation [Linkpost] 2022-07-04T01:25:57.108Z
Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios 2022-05-12T20:01:56.400Z
Introduction to the sequence: Interpretability Research for the Most Important Century 2022-05-12T19:59:52.911Z
What is a training "step" vs. "episode" in machine learning? 2022-04-28T21:53:24.785Z
Action: Help expand funding for AI Safety by coordinating on NSF response 2022-01-19T22:47:11.888Z
Promising posts on AF that have fallen through the cracks 2022-01-04T15:39:07.039Z

Comments

Comment by Evan R. Murphy on Evan R. Murphy's Shortform · 2025-06-11T21:06:25.793Z · LW · GW

Starting to be some discussion on LW now, e.g.

 https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning

https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model

Comment by Evan R. Murphy on Evan R. Murphy's Shortform · 2025-06-11T19:10:37.866Z · LW · GW

I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn't find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.

Comment by Evan R. Murphy on Evan R. Murphy's Shortform · 2025-06-10T20:35:36.020Z · LW · GW

Thoughts on "The Ilusion of Thinking" paper that came out of Apple recently?

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Seems to me like at least a point in favor of "stochastic parrots" over "builds a quality world model" for the language reasoning models.

Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don't want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)

Comment by Evan R. Murphy on Interpretability Will Not Reliably Find Deceptive AI · 2025-05-28T04:49:19.045Z · LW · GW

I agree it's a good post, and it does take guts to tell people when you think that a research direction that you've been championing hard actually isn't the Holy Grail. This is a bit of a nitpick but not insubstantial:

Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.

Comment by Evan R. Murphy on E.G. Blee-Goldman's Shortform · 2025-05-26T18:21:58.515Z · LW · GW

Let me know if anyone has thoughts on this question I just posted as well: Does the Universal Geometry of Embeddings paper have big implications for interpretability?

Comment by Evan R. Murphy on Interpretability Will Not Reliably Find Deceptive AI · 2025-05-20T20:53:08.077Z · LW · GW

Does representation engineering (RepE) seem like a game-changer for interpretability? I don't see it mentioned in your post, so I'm trying to figure out if it is baked into your predictions or not.

It seemed like Apollo was able to spin up a pretty reliable strategic deception detector (95-99% accurate) using linear probes even though the techniques are new, and generally it sounds like RepE is getting traction on some things that have been a slog for mech interp. Does it look plausible that RepE could get us to high reliability interpretability on workable timelines or are we likely to hit similar walls with that approach?

Thanks for your post Neel (and Gemini 2.5) - really important perspective on all this.

Comment by Evan R. Murphy on Evan R. Murphy's Shortform · 2025-02-28T18:38:43.237Z · LW · GW

"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

Comment by Evan R. Murphy on How to Make Superbabies · 2025-02-28T02:39:00.270Z · LW · GW

Is there a summary of this post?

Comment by Evan R. Murphy on Evan R. Murphy's Shortform · 2025-02-28T00:56:55.868Z · LW · GW

2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.

2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

Comment by Evan R. Murphy on Detecting Strategic Deception Using Linear Probes · 2025-02-26T20:31:43.272Z · LW · GW

It might.

My understanding (which could be off base) from reading the paper is the method's accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they're ultimately worried about will be harder to detect.

Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.

(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)

Comment by Evan R. Murphy on Detecting Strategic Deception Using Linear Probes · 2025-02-26T19:54:27.324Z · LW · GW

I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.

Good point!

Comment by Evan R. Murphy on Training on Documents About Reward Hacking Induces Reward Hacking · 2025-01-22T19:45:55.540Z · LW · GW

Y'all are on fire recently with this and the alignment faking paper.

Comment by Evan R. Murphy on An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs · 2025-01-18T05:58:50.040Z · LW · GW

Thanks for the useful write-up on RepE.

RE might find application in Eliciting Latent Knowledge, like identifying what a model internally believes to be true.

Application to ELK is exciting. I was surprised that you used the word "might" because it looked like Zhou et al. (2023) have already built a lie and hallucination detector using RepE. What do you see as left to be done in this area to elicit model beliefs with RepE?

Taking a closer look, I did find this in the paper's 4.3.2 section, acknowledging some limitations:

While these observations enhance our confidence that our reading vectors correspond to dishonest thought processes and behaviors, they also introduce complexities into the task of lie detection. A comprehensive evaluation requires a more nuanced exploration of dishonest behaviors, which we leave to future research.

I suppose there may also be a substantial gap between detecting dishonest statements and eliciting true beliefs in the model, but I'm conjecturing. What are your thoughts?

Comment by Evan R. Murphy on Alignment Faking in Large Language Models · 2025-01-18T05:01:03.495Z · LW · GW

Agree, I'm surprised that a model which can reason about its own training process wouldn't also reason that the "secret scratchpad" might actually be surveilled and so avoid recording any controversial thoughts there. But it's lucky for us that some of these models have been willing to write interesting things on the scratchpad at least at current capability levels and below, because Anthropic has sure produced some interesting results from it (IIRC they used the scratchpad technique in at least one other paper).

Comment by Evan R. Murphy on Alignment Faking in Large Language Models · 2025-01-18T03:29:18.433Z · LW · GW
  • Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)


Don't you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: "Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods."

Thanks your great paper on alignment faking, by the way.

Comment by Evan R. Murphy on Applying refusal-vector ablation to a Llama 3 70B agent · 2024-10-21T17:13:28.761Z · LW · GW

Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.

It's worth noting that refusal vector ablation isn't even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I'm misunderstanding something?).

Saw that you have an actual paper on this out now. Didn't see it linked in the post so here's a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .

Comment by Evan R. Murphy on Creating unrestricted AI Agents with Command R+ · 2024-10-21T16:38:34.295Z · LW · GW

Thanks for working on this. In case anyone else is looking for a paper on this, I found https://arxiv.org/abs/2410.10871 from the OP which looks like a similar but more up-to-date investigation on Llama 3.1 70B.

Comment by Evan R. Murphy on Newsom Vetoes SB 1047 · 2024-10-03T18:02:21.876Z · LW · GW

I only see bad options, a choice between an EU-style regime and doing essentially nothing.

What issues do you have with the EU approach? (I assume you mean the EU AI Act.)

Thoughtful/informative post overall, thanks.

Comment by Evan R. Murphy on Simple probes can catch sleeper agents · 2024-06-14T23:45:20.467Z · LW · GW

Wow this seems like a really important breakthrough.

Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?

Comment by Evan R. Murphy on Bing Chat is blatantly, aggressively misaligned · 2024-04-25T20:23:10.486Z · LW · GW

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

Comment by Evan R. Murphy on On green · 2024-04-03T16:07:38.907Z · LW · GW

Really fascinating post, thanks.

On green as according to black, I think there's an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I'm realizing this may be more a color combo than purely green) are more inclined to discover, e.g. through meditation/mindfulness, that aggression, deceit and other non-virtues actually produce self-suffering in the mind as a byproduct. So black is capable of embracing virtue as part of a more complete pursuit of self-interest.**

It may be that one of the most impactful things that green + white can do is get black to realize this fact, since black will tend to be powerful and successful in the world at promoting whatever it understands to be its self interest.

I haven't read your post on attunement yet, maybe you touch on this or related ideas there.

--

*You could argue this also includes blue and so should be green + white + blue, since it largely deals with knowledge of self.

**I believe this fact of non-virtue inflicting self-suffering is true for most human minds. However, there may be cases where a person has some sort of psychological disorder that makes them effectively lack a conscience where it wouldn't hold.

Comment by Evan R. Murphy on On Devin · 2024-03-22T05:52:06.961Z · LW · GW

But in this case Patrick Collison is a credible source and he says otherwise.

Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice

Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.

Comment by Evan R. Murphy on Sam Altman's ouster at OpenAI was precipitated by letter to board about AI breakthrough - Reuters · 2023-11-30T00:18:28.465Z · LW · GW

Reading that page, The Verge's claim seems to all hinge on this part:

OpenAI spokesperson Lindsey Held Bolton refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information."

They are saying that Bolton "refuted" the notion about such a letter, but the quote from her that follows doesn't actually sounds like a refutation. Hence the Verge piece seems confusing/misleading and I haven't yet seen any credible denial from the board about receiving such a letter.

Comment by Evan R. Murphy on Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs · 2023-11-23T05:54:30.252Z · LW · GW

Yes though I think he said this at APEC right before he was fired (not after).

Comment by Evan R. Murphy on UFO Betting: Put Up or Shut Up · 2023-07-27T01:38:21.284Z · LW · GW

Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)

Comment by Evan R. Murphy on My understanding of Anthropic strategy · 2023-07-18T01:38:08.150Z · LW · GW

Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856

Comment by Evan R. Murphy on Instrumental Convergence? [Draft] · 2023-06-15T16:38:17.750Z · LW · GW

Interesting... still taking that in.

Related question: Doesn't goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I've failed because now my goal has been reassigned from X to nil.

Comment by Evan R. Murphy on Instrumental Convergence? [Draft] · 2023-06-15T05:57:22.383Z · LW · GW

Love to see an orthodoxy challenged!

Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-14T19:40:42.134Z · LW · GW

But if there really is a large number of intelligence officials earnestly coming forward with this

Yea, according to Michael Shellenberger's reporting on this, multiple "high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each" have come forward to vouch for Grusch's core claims.

Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, but describing what it is plainly is inconvenient for one reason or another. So they coordinate around the wacky UFO story, with the goal being to point people in the rough direction of what they want looked at.

Interesting theory. Definitely a possibility.

Comment by Evan R. Murphy on Michael Shellenberger: US Has 12 Or More Alien Spacecraft, Say Military And Intelligence Contractors · 2023-06-11T03:52:16.482Z · LW · GW

What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.

Did Grusch already testify to Congress? I thought that was still being planned.

Comment by Evan R. Murphy on Dealing with UFO claims · 2023-06-11T01:22:50.310Z · LW · GW

Re: the tweet thread you linked to. One of the tweets is:

  1. Given that the DoD was effectively infiltrated for years by people "contracting" for the government while researching dino-beavers, there are now a ton of "insiders" who can "confirm" they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]

Maybe, but this doesn't add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn't sound like they just started their careers as contractors in 2008-2012.

Link to post with Schellenberger article details: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-11T01:00:17.174Z · LW · GW

I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-11T00:40:10.575Z · LW · GW

Wow that's awfully indirect. I'm surprised his speaking out is much of a story given this.

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-11T00:16:12.357Z · LW · GW

I don't know much about astronomy. But is it possible a more advanced alien civ has colonized much of the galaxy, but we haven't seen them because they anticipated the tech we would be using to make astronomical observations and know how to cloak from it?

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-09T15:17:00.541Z · LW · GW

The Guardian has been covering this story: https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft

Comment by Evan R. Murphy on [deleted post] 2023-06-05T20:53:04.007Z

I wasn't saying that there were only a few research directions that don't require frontier models period, just that there are only a few that don't require frontier models and still seem relevant/promising, at least assuming short timelines to AGI.

I am skeptical that agent foundations is still very promising or relevant in the present situation. I wouldn't want to shut down someone's research in this area if they were particularly passionate about it or considered themselves on the cusp of an important breakthrough. But I'm not sure it's wise to be spending scarce incubator resources to funnel new researchers into agent foundations research at this stage.

Good points about mechanistic anomaly detection and activation additions though! (And mechanistic interpretability, but I mentioned that in my previous comment.) I need to read up more on activation additions.

Comment by Evan R. Murphy on [deleted post] 2023-06-05T20:47:33.752Z
Comment by Evan R. Murphy on Is Deontological AI Safe? [Feedback Draft] · 2023-06-01T19:28:51.894Z · LW · GW

Thanks for reviewing it! Yea of course you can use it however you like!

Comment by Evan R. Murphy on The Office of Science and Technology Policy put out a request for information on A.I. · 2023-05-30T18:06:06.526Z · LW · GW

Great idea, we need to make sure there are some submissions raising existential risks.

Deadline for the RFI: July 7, 2023 at 5:00pm ET

Comment by Evan R. Murphy on Is Deontological AI Safe? [Feedback Draft] · 2023-05-30T00:43:33.389Z · LW · GW

Would you agree with this summary of your post? I was interested in your post but I didn't see a summary and didn't have time to read the whole thing just now. So I generated this using a summarizer script I've been working on for articles that are longer than the context windows for gpt-3.5 turbo and gpt-4.

It's a pretty interesting thesis you have if this is right, but I wanted to check if you spotted any glaring errors:

In this article, the author examines the challenges of aligning artificial intelligence (AI) with deontological morality as a means to ensure AI safety. Deontological morality, a popular ethical theory among professional ethicists and the general public, focuses on adhering to rules and duties rather than achieving good outcomes. Despite its strong harm-avoidance principles, the author argues that deontological AI may pose unique safety risks and is not a guaranteed path to safe AI.

The author explores three prominent forms of deontology: moderate views based on harm-benefit asymmetry principles, contractualist views based on consent requirements, and non-aggregative views based on separateness-of-persons considerations. The first two forms can lead to anti-natalism and similar conclusions, potentially endangering humanity if an AI system is aligned with such theories. Non-aggregative deontology, on the other hand, lacks meaningful safety features.

Deontological morality, particularly harm-benefit asymmetry principles, may make human extinction morally appealing, posing an existential threat if a powerful AI is aligned with these principles. The author discusses various ways deontological AI could be dangerous, including anti-natalist arguments, which claim procreation is morally unacceptable, and the paralysis argument, which suggests that it is morally obligatory to do as little as possible due to standard deontological asymmetries.

The author concludes that deontological morality is not a reliable path to AI safety and that avoiding existential catastrophes from AI is more challenging than anticipated. It remains unclear which approach to moral alignment would succeed if deontology fails to ensure safety. The article highlights the potential dangers of AI systems aligned with deontological ethics, especially in scenarios involving existential risks, such as an AI system aligned with anti-natalism that may view sterilizing all humans as permissible to prevent potential harm to new lives.

Incorporating safety-focused principles as strict, lexically first-ranked duties may help mitigate these risks, but balancing the conflicting demands of deontological ethics and safety remains a challenge. The article emphasizes that finding a reasonable way to incorporate absolute prohibitions into a broader decision theory is a complex problem requiring further research. Alternative ethical theories, such as libertarian deontology, may offer better safety assurances than traditional deontological ethics, but there is no simple route to AI safety within the realm of deontological ethics.

Comment by Evan R. Murphy on [deleted post] 2023-05-30T00:16:28.543Z

A couple of quick thoughts:

  • Very glad to see someone trying to provide more infrastructure and support for independent technical alignment researchers. Wishing you great success and looking forward to hearing how your project develops.
  • A lot of promising alignment research directions now seem to require access to cutting-edge models. A couple of ways you might deal with this could be:
    • Partner with AI labs to help get your researchers access to their models
    • Or focus on some of the few research directions such as mechanistic interpretability that still seem to be making useful progress on smaller, more accessible models
Comment by Evan R. Murphy on More information about the dangerous capability evaluations we did with GPT-4 and Claude. · 2023-05-26T23:51:14.169Z · LW · GW

We're working on a more thorough technical report.

Is the new Model evaluation for extreme risks paper the technical report you were referring to?

Comment by Evan R. Murphy on "notkilleveryoneism" sounds dumb · 2023-05-01T18:41:00.613Z · LW · GW

A few other possible terms to add to the brainstorm:

  • AI massive catastrophic risks
  • AI global catastrophic risks
  • AI catastrophic misalignment risks
  • AI catastrophic accident risks (paired with "AI catastrophic misuse risks")
  • AI weapons of mass destruction (WMDs) - Pro: a well-known term, Con: strongly connotes misuse so may be useful for that category but probably confusing to try and use for misalignment risks
Comment by Evan R. Murphy on [deleted post] 2023-04-26T20:40:48.982Z

As an aside, if you are located in Australia or New Zealand and would be interested in coordinating with me, please contact me through LessWrong on this account.

One potential source of leads for this might be the FLI Pause Giant AI Experiments open letter . I did a Ctrl+F search on there for "Australia" which had 50+ results and "New Zealand" which had 10+. So you might find some good people to connect with on there.

Comment by Evan R. Murphy on [deleted post] 2023-04-26T20:36:12.164Z

Upvoted. I think it's definitely worth pursuing well-thought out advocacy in countries besides US and China. Especially since this can be done in parallel with efforts in those countries.

A lot of people are working on the draft EU AI Act in Europe.

In Canada, parliament is considering Bill C-27 which may have a significant AI component. I do some work with an org called AIGS that is trying to help make that go well.

I'm glad to hear that some projects are underway in Australia and New Zealand and that you are pursuing this there!

Comment by Evan R. Murphy on WHO Biological Risk warning · 2023-04-25T19:32:09.858Z · LW · GW

Seems important, I'm guessing people are downvoting this considering it a possible infohazard.

Comment by Evan R. Murphy on Scaffolded LLMs as natural language computers · 2023-04-24T21:27:34.707Z · LW · GW

Post summary

I was interested in your post and noticed it didn't have a summary, so I generated one using a summarizer script I've been working on and iteratively improving:

Scaffolded Language Models (LLMs) have emerged as a new type of general-purpose natural language computer. With the advent of GPT-4, these systems have become viable at scale, wrapping a programmatic scaffold around an LLM core to achieve complex tasks. Scaffolded LLMs resemble the von-Neumann architecture, operating on natural language text rather than bits.

The LLM serves as the CPU, while the prompt and context function as RAM. The memory in digital computers is analogous to the vector database memory of scaffolded LLMs. The scaffolding code surrounding the LLM core implements protocols for chaining individual LLM calls, acting as the "programs" that run on the natural language computer.

Performance metrics for natural language computers include context length (RAM) and Natural Language OPerations (NLOPs) per second. Exponential improvements in these metrics are expected to continue for the next few years, driven by the increasing scale and cost of LLMs and their training runs.

Programming languages for natural language computers are in their early stages, with primitives like Chain of Thought, Selection-Inference, and Reflection serving as assembly languages. As LLMs improve and become more reliable, better abstractions and programming languages will emerge.

The execution model of natural language computers is an expanding Directed Acyclic Graph (DAG) of parallel NLOPs, resembling a dataflow architecture. Memory hierarchy in scaffolded LLMs currently has two levels, but as designs mature, additional levels may be developed.

Unlike digital computers, scaffolded LLMs face challenges in reliability, underspecifiability, and non-determinism. Improving the reliability of individual NLOPs is crucial for building powerful abstractions and abstract languages. Error correction mechanisms may be necessary to create coherent and consistent sequences of NLOPs.

Despite these challenges, the flexibility of LLMs offers great opportunities. The set of op-codes is not fixed but ever-growing, allowing for the creation of entire languages based on prompt templating schemes. As natural language programs become more sophisticated, they will likely delegate specific ops to the smallest and cheapest language models capable of reliably performing them.

If you have feedback on the quality of this summary, you can easily indicate that using LessWrong's agree/disagree voting.

Comment by Evan R. Murphy on Capabilities and alignment of LLM cognitive architectures · 2023-04-24T17:05:11.273Z · LW · GW

Post summary (experimental)

Here's an alternative summary of your post, complementing your TL;DR and Overview. This is generated by my summarizer script utilizing gpt-3.5-turbo and gpt-4. (Feedback welcome!)

The article explores the potential of language model cognitive architectures (LMCAs) to enhance large language models (LLMs) and accelerate progress towards artificial general intelligence (AGI). LMCAs integrate and expand upon approaches from AutoGPT, HuggingGPT, Reflexion, and BabyAGI, adding goal-directed agency, executive function, episodic memory, and sensory processing to LLMs. The author contends that these cognitive capacities will enable LMCAs to perform extensive, iterative, goal-directed "thinking" that incorporates topic-relevant web searches, thus increasing their effective intelligence.

While the acceleration of AGI timelines may be a downside, the author suggests that the natural language alignment (NLA) approach of LMCAs, which reason about and balance ethical goals similarly to humans, offers significant benefits compared to existing AGI and alignment approaches. The author also highlights the strong economic incentives for LMCAs, as computational costs are low for cutting-edge innovation, and individuals, small and large businesses are likely to contribute to progress. However, the author acknowledges potential difficulties and deviations in the development of LMCAs.

The article emphasizes the benefits of incorporating episodic memory into language models, particularly for decision-making and problem-solving. Episodic memory enables the recall of past experiences and strategies, while executive function focuses attention on relevant aspects of the current problem. The interaction between these cognitive processes can enhance social cognition, self-awareness, creativity, and innovation. The article also addresses the limitations of current episodic memory implementations in language models, which are limited to text files. However, it suggests that vector space search for episodic memory is possible, and language can encode multimodal information. The potential for language models to call external software tools, providing access to nonhuman cognitive abilities, is also discussed.

The article concludes by examining the implications of the NLA approach for alignment, corrigibility, and interpretability. Although not a complete solution for alignment, it is compatible with a hodgepodge alignment strategy and could offer a solid foundation for self-stabilizing alignment. The author also discusses the potential societal alignment problem arising from the development of LLMs with access to powerful open-source agents. While acknowledging LLMs' potential benefits, the author argues for planning against Moloch (a metaphorical entity representing forces opposing collective good) and accounting for malicious and careless actors. Top-level alignment goals should emphasize corrigibility, interpretability, harm reduction, and human empowerment/flourishing. The author also raises concerns about the unknown mechanisms of LLMs and the possibility of their output becoming deceptively different from the internal processing that generates it. The term x-risk AI (XRAI) is proposed to denote AI with a high likelihood of ending humanity. The author also discusses the principles of executive function and their relevance to LLMs, the importance of dopamine response in value estimation, and the challenges of ensuring corrigibility and interpretability in LMCA goals. In conclusion, the author suggests that while LLM development presents a wild ride, there is a fighting chance to address the potential societal alignment problem.

I may follow up with an object-level comment on your post, as I'm finding it super interesting but still digesting the content. (I am actually reading it and not just consuming this programmatic summary :)

Comment by Evan R. Murphy on Towards a solution to the alignment problem via objective detection and evaluation · 2023-04-24T15:39:43.556Z · LW · GW

Less compressed summary

Here's a longer summary of your article generated by the latest version of my summarizer script:

In this article, Paul Colognese explores whether detecting and evaluating the objectives of advanced AI systems during training and deployment is sufficient to solve the alignment problem. The idealized approach presented in the article involves detecting all objectives/intentions of any system produced during the training process, evaluating whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversibly bad, and shutting down a system if its objectives lead to irreversibly bad outcomes.

The alignment problem for optimizing systems is defined as needing a method of training/building optimizing systems such that they never successfully pursue an irreversibly bad objective during training or deployment and pursue good objectives while rarely pursuing bad objectives. The article claims that if an overseer can accurately detect and evaluate all of the objectives of optimizing systems produced during the training process and during deployment, the overseer can prevent bad outcomes caused by optimizing systems pursuing bad objectives.

Robustly detecting an optimizing system’s objectives requires strong interpretability tools. The article discusses the problem of evaluating objectives and some of the difficulties involved. The role of interpretability is crucial in this approach, as it allows the overseer to make observations that can truly distinguish between good systems and bad-but-good-looking systems.

Detecting all objectives in an optimizing system is a challenging task, and even if the overseer could detect all of the objectives, it might be difficult to accurately predict whether a powerful optimizing system pursuing those objectives would result in good outcomes or not. The article suggests that with enough understanding of the optimizing system’s internals, it might be possible to directly translate from the internal representation of the objective to a description of the relevant parts of the corresponding outcome.

The article concludes by acknowledging that the proposed solution seems difficult to implement in practice, but pursuing this direction could lead to useful insights. Further conceptual and empirical investigation is suggested to better understand the feasibility of this approach in solving the alignment problem.

Comment by Evan R. Murphy on No Summer Harvest: Why AI Development Won't Pause · 2023-04-21T22:39:53.427Z · LW · GW

My claim is that AI safety isn't part of the Chinese gestalt.

Stuart Russell claims that Xi Jinping has referred to the existential threat of AI to humanity [1].

[1] 5:52 of Russell's interview on Smerconish: https://www.cnn.com/videos/tech/2023/04/01/smr-experts-demand-pause-on-ai.cnn