evan-r-murphy

Posts
Comments

Posts

Evan R. Murphy's Shortform 2025-02-28T00:56:55.873Z

Steven Pinker on ChatGPT and AGI (Feb 2023) 2023-03-05T21:34:14.846Z

Steering Behaviour: Testing for (Non-)Myopia in Language Models 2022-12-05T20:28:33.025Z

Paper: Large Language Models Can Self-improve [Linkpost] 2022-10-02T01:29:00.181Z

Google AI integrates PaLM with robotics: SayCan update [Linkpost] 2022-08-24T20:54:34.438Z

Surprised by ELK report's counterexample to Debate, IDA 2022-08-04T02:12:15.139Z

New US Senate Bill on X-Risk Mitigation [Linkpost] 2022-07-04T01:25:57.108Z

Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios 2022-05-12T20:01:56.400Z

Introduction to the sequence: Interpretability Research for the Most Important Century 2022-05-12T19:59:52.911Z

What is a training "step" vs. "episode" in machine learning? 2022-04-28T21:53:24.785Z

Action: Help expand funding for AI Safety by coordinating on NSF response 2022-01-19T22:47:11.888Z

Promising posts on AF that have fallen through the cracks 2022-01-04T15:39:07.039Z

Comments

Comment by Evan R. Murphy on Evan R. Murphy's Shortform · 2025-02-28T18:38:43.237Z · LW · GW

"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.

Comment by Evan R. Murphy on How to Make Superbabies · 2025-02-28T02:39:00.270Z · LW · GW

Is there a summary of this post?

Comment by Evan R. Murphy on Evan R. Murphy's Shortform · 2025-02-28T00:56:55.868Z · LW · GW

2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.

2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.

Comment by Evan R. Murphy on Detecting Strategic Deception Using Linear Probes · 2025-02-26T20:31:43.272Z · LW · GW

It might.

My understanding (which could be off base) from reading the paper is the method's accuracy in detecting various forms of deception was basically 96-99%. But they acknowledge that the sophisticated deception they're ultimately worried about will be harder to detect.

Still 96-99% seems like a great start. And this was on detecting strategic deception, not just factual falsehoods. And they didn't even utilize the CoT outputs of the models.

(I think the "strategic deception" framing is also probably more general and not as dependent on unnecessary assumptions about how models work, compared to the "mesaoptimizer" framing.)

Comment by Evan R. Murphy on Detecting Strategic Deception Using Linear Probes · 2025-02-26T19:54:27.324Z · LW · GW

I also suspect training a specific dataset for "was your last response indicative of your maximum performance on the task?" would significantly boost the accuracy in the sandbagging case for this type of probe. I see this becoming more important as we move to more realistic sandbagging settings without explicit instructions to perform poorly since "deceiving someone" and "not trying very hard to explore this area of research" seem like non-trivially different concepts.

Good point!

Comment by Evan R. Murphy on Training on Documents About Reward Hacking Induces Reward Hacking · 2025-01-22T19:45:55.540Z · LW · GW

Y'all are on fire recently with this and the alignment faking paper.

Comment by Evan R. Murphy on An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs · 2025-01-18T05:58:50.040Z · LW · GW

Thanks for the useful write-up on RepE.

RE might find application in Eliciting Latent Knowledge, like identifying what a model internally believes to be true.

Application to ELK is exciting. I was surprised that you used the word "might" because it looked like Zhou et al. (2023) have already built a lie and hallucination detector using RepE. What do you see as left to be done in this area to elicit model beliefs with RepE?

Taking a closer look, I did find this in the paper's 4.3.2 section, acknowledging some limitations:

While these observations enhance our confidence that our reading vectors correspond to dishonest thought processes and behaviors, they also introduce complexities into the task of lie detection. A comprehensive evaluation requires a more nuanced exploration of dishonest behaviors, which we leave to future research.

I suppose there may also be a substantial gap between detecting dishonest statements and eliciting true beliefs in the model, but I'm conjecturing. What are your thoughts?

Comment by Evan R. Murphy on Alignment Faking in Large Language Models · 2025-01-18T05:01:03.495Z · LW · GW

Agree, I'm surprised that a model which can reason about its own training process wouldn't also reason that the "secret scratchpad" might actually be surveilled and so avoid recording any controversial thoughts there. But it's lucky for us that some of these models have been willing to write interesting things on the scratchpad at least at current capability levels and below, because Anthropic has sure produced some interesting results from it (IIRC they used the scratchpad technique in at least one other paper).

Comment by Evan R. Murphy on Alignment Faking in Large Language Models · 2025-01-18T03:29:18.433Z · LW · GW

Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)

Don't you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: "Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods."

Thanks your great paper on alignment faking, by the way.

Comment by Evan R. Murphy on Applying refusal-vector ablation to a Llama 3 70B agent · 2024-10-21T17:13:28.761Z · LW · GW

Exposing the weaknesses of fine-tuned models like the Llama 3.1 Instruct models against refusal vector ablation is important because the industry seems to really have overreliance on these safety techniques currently.

It's worth noting that refusal vector ablation isn't even necessary for this sort of malicious use with Llama 3.1 though because Meta also released the base pretrained models without instruction finetuning (unless I'm misunderstanding something?).

Saw that you have an actual paper on this out now. Didn't see it linked in the post so here's a clickable for anyone else looking: https://arxiv.org/abs/2410.10871 .

Comment by Evan R. Murphy on Creating unrestricted AI Agents with Command R+ · 2024-10-21T16:38:34.295Z · LW · GW

Thanks for working on this. In case anyone else is looking for a paper on this, I found https://arxiv.org/abs/2410.10871 from the OP which looks like a similar but more up-to-date investigation on Llama 3.1 70B.

Comment by Evan R. Murphy on Newsom Vetoes SB 1047 · 2024-10-03T18:02:21.876Z · LW · GW

I only see bad options, a choice between an EU-style regime and doing essentially nothing.

What issues do you have with the EU approach? (I assume you mean the EU AI Act.)

Thoughtful/informative post overall, thanks.

Comment by Evan R. Murphy on Simple probes can catch sleeper agents · 2024-06-14T23:45:20.467Z · LW · GW

Wow this seems like a really important breakthrough.

Are defection probes also a solution to the undetectable backdoor problem from Goldwasser et al. 2022?

Comment by Evan R. Murphy on Bing Chat is blatantly, aggressively misaligned · 2024-04-25T20:23:10.486Z · LW · GW

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

Comment by Evan R. Murphy on On green · 2024-04-03T16:07:38.907Z · LW · GW

Really fascinating post, thanks.

On green as according to black, I think there's an additional facet perhaps even more important than just the acknowledgment that sometimes we are too weak to succeed and so should conserve energy. Black being strongly self-interested will tend to cast aside virtues like generosity, honesty and non-harm except as means in social games they are playing to achieve other ends for themselves. But self-interest tends to include desire for reduction of self-suffering. Green + white* (as I'm realizing this may be more a color combo than purely green) are more inclined to discover, e.g. through meditation/mindfulness, that aggression, deceit and other non-virtues actually produce self-suffering in the mind as a byproduct. So black is capable of embracing virtue as part of a more complete pursuit of self-interest.**

It may be that one of the most impactful things that green + white can do is get black to realize this fact, since black will tend to be powerful and successful in the world at promoting whatever it understands to be its self interest.

I haven't read your post on attunement yet, maybe you touch on this or related ideas there.

*You could argue this also includes blue and so should be green + white + blue, since it largely deals with knowledge of self.

**I believe this fact of non-virtue inflicting self-suffering is true for most human minds. However, there may be cases where a person has some sort of psychological disorder that makes them effectively lack a conscience where it wouldn't hold.

Comment by Evan R. Murphy on On Devin · 2024-03-22T05:52:06.961Z · LW · GW

But in this case Patrick Collison is a credible source and he says otherwise.

Patrick Collison: These aren’t just cherrypicked demos. Devin is, in my experience, very impressive in practice

Patrick is an investor in Cognition. So while he may still be credible in this case, he also has a conflict of interest.

Comment by Evan R. Murphy on Sam Altman's ouster at OpenAI was precipitated by letter to board about AI breakthrough - Reuters · 2023-11-30T00:18:28.465Z · LW · GW

Reading that page, The Verge's claim seems to all hinge on this part:

OpenAI spokesperson Lindsey Held Bolton refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information."

They are saying that Bolton "refuted" the notion about such a letter, but the quote from her that follows doesn't actually sounds like a refutation. Hence the Verge piece seems confusing/misleading and I haven't yet seen any credible denial from the board about receiving such a letter.

Comment by Evan R. Murphy on Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs · 2023-11-23T05:54:30.252Z · LW · GW

Yes though I think he said this at APEC right before he was fired (not after).

Comment by Evan R. Murphy on UFO Betting: Put Up or Shut Up · 2023-07-27T01:38:21.284Z · LW · GW

Carl, have you written somewhere about why you are confident that all UFOs so far are prosaic in nature? Would be interest to read/listen to your thoughts on this. (Alternatively, a link to some other source that you find gives a particularly compelling explanation is also good.)

Comment by Evan R. Murphy on My understanding of Anthropic strategy · 2023-07-18T01:38:08.150Z · LW · GW

Great update from Anthropic on giving majority control of the board to a financially disinterested trust: https://twitter.com/dylanmatt/status/1680924158572793856

Comment by Evan R. Murphy on Instrumental Convergence? [Draft] · 2023-06-15T16:38:17.750Z · LW · GW

Interesting... still taking that in.

Related question: Doesn't goal preservation typically imply self preservation? If I want to preserve my goal, and then I perish, I've failed because now my goal has been reassigned from X to nil.

Comment by Evan R. Murphy on Instrumental Convergence? [Draft] · 2023-06-15T05:57:22.383Z · LW · GW

Love to see an orthodoxy challenged!

Suppose Sia's only goal is to commit suicide, and she's given the opportunity to kill herself straightaway. Then, it certainly won't be rational for her to pursue self-preservation.

It seems you found one terminal goal which doesn't give rise to the instrumental subgoal of self-preservation. Are there others, or does basically every terminal goal benefit from instrumental self-preservation except for suicide?

(I skipped around a bit and didn't read your full post, so maybe you explain this already and I missed it.)

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-14T19:40:42.134Z · LW · GW

But if there really is a large number of intelligence officials earnestly coming forward with this

Yea, according to Michael Shellenberger's reporting on this, multiple "high-ranking intelligence officials, former intelligence officials, or individuals who we could verify were involved in U.S. government UAP efforts for three or more decades each" have come forward to vouch for Grusch's core claims.

Perhaps this is genuine whistleblowing, but not on what they make it sound like? Suppose there's something being covered up that Grusch et al. want to expose, but describing what it is plainly is inconvenient for one reason or another. So they coordinate around the wacky UFO story, with the goal being to point people in the rough direction of what they want looked at.

Interesting theory. Definitely a possibility.

Comment by Evan R. Murphy on Michael Shellenberger: US Has 12 Or More Alien Spacecraft, Say Military And Intelligence Contractors · 2023-06-11T03:52:16.482Z · LW · GW

What matters is the hundreds of pages and photos and hours of testimony given under oath to the Intelligence Community Inspector General and Congress.

Did Grusch already testify to Congress? I thought that was still being planned.

Comment by Evan R. Murphy on Dealing with UFO claims · 2023-06-11T01:22:50.310Z · LW · GW

Re: the tweet thread you linked to. One of the tweets is:

Given that the DoD was effectively infiltrated for years by people "contracting" for the government while researching dino-beavers, there are now a ton of "insiders" who can "confirm" they heard the same outlandish rumors, leading to stuff like this: [references Michael Schellenberger]

Maybe, but this doesn't add up to me because Schellenberger said his sources had had multiple decades long careers in the gov agencies. It didn't sound like they just started their careers as contractors in 2008-2012.

Link to post with Schellenberger article details: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-11T01:00:17.174Z · LW · GW

I guess the fact that this journalist says multiple other intelligence officials are anonymously vouching for Grusch's claims makes it interesting again: https://www.lesswrong.com/posts/bhH2BqF3fLTCwgjSs/michael-shellenberger-us-has-12-or-more-alien-spacecraft-say#comments

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-11T00:40:10.575Z · LW · GW

Wow that's awfully indirect. I'm surprised his speaking out is much of a story given this.

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-11T00:16:12.357Z · LW · GW

I don't know much about astronomy. But is it possible a more advanced alien civ has colonized much of the galaxy, but we haven't seen them because they anticipated the tech we would be using to make astronomical observations and know how to cloak from it?

Comment by Evan R. Murphy on Intelligence Officials Say U.S. Has Retrieved Craft of Non-Human Origin · 2023-06-09T15:17:00.541Z · LW · GW

The Guardian has been covering this story: https://www.theguardian.com/world/2023/jun/06/whistleblower-ufo-alien-tech-spacecraft

Comment by Evan R. Murphy on [deleted post] 2023-06-05T20:53:04.007Z

I wasn't saying that there were only a few research directions that don't require frontier models period, just that there are only a few that don't require frontier models and still seem relevant/promising, at least assuming short timelines to AGI.

I am skeptical that agent foundations is still very promising or relevant in the present situation. I wouldn't want to shut down someone's research in this area if they were particularly passionate about it or considered themselves on the cusp of an important breakthrough. But I'm not sure it's wise to be spending scarce incubator resources to funnel new researchers into agent foundations research at this stage.

Good points about mechanistic anomaly detection and activation additions though! (And mechanistic interpretability, but I mentioned that in my previous comment.) I need to read up more on activation additions.

Comment by Evan R. Murphy on [deleted post] 2023-06-05T20:47:33.752Z

Comment by Evan R. Murphy on Is Deontological AI Safe? [Feedback Draft] · 2023-06-01T19:28:51.894Z · LW · GW

Thanks for reviewing it! Yea of course you can use it however you like!

Comment by Evan R. Murphy on The Office of Science and Technology Policy put out a request for information on A.I. · 2023-05-30T18:06:06.526Z · LW · GW

Great idea, we need to make sure there are some submissions raising existential risks.

Deadline for the RFI: July 7, 2023 at 5:00pm ET

Comment by Evan R. Murphy on Is Deontological AI Safe? [Feedback Draft] · 2023-05-30T00:43:33.389Z · LW · GW

Would you agree with this summary of your post? I was interested in your post but I didn't see a summary and didn't have time to read the whole thing just now. So I generated this using a summarizer script I've been working on for articles that are longer than the context windows for gpt-3.5 turbo and gpt-4.

It's a pretty interesting thesis you have if this is right, but I wanted to check if you spotted any glaring errors:

In this article, the author examines the challenges of aligning artificial intelligence (AI) with deontological morality as a means to ensure AI safety. Deontological morality, a popular ethical theory among professional ethicists and the general public, focuses on adhering to rules and duties rather than achieving good outcomes. Despite its strong harm-avoidance principles, the author argues that deontological AI may pose unique safety risks and is not a guaranteed path to safe AI.
The author explores three prominent forms of deontology: moderate views based on harm-benefit asymmetry principles, contractualist views based on consent requirements, and non-aggregative views based on separateness-of-persons considerations. The first two forms can lead to anti-natalism and similar conclusions, potentially endangering humanity if an AI system is aligned with such theories. Non-aggregative deontology, on the other hand, lacks meaningful safety features.
Deontological morality, particularly harm-benefit asymmetry principles, may make human extinction morally appealing, posing an existential threat if a powerful AI is aligned with these principles. The author discusses various ways deontological AI could be dangerous, including anti-natalist arguments, which claim procreation is morally unacceptable, and the paralysis argument, which suggests that it is morally obligatory to do as little as possible due to standard deontological asymmetries.
The author concludes that deontological morality is not a reliable path to AI safety and that avoiding existential catastrophes from AI is more challenging than anticipated. It remains unclear which approach to moral alignment would succeed if deontology fails to ensure safety. The article highlights the potential dangers of AI systems aligned with deontological ethics, especially in scenarios involving existential risks, such as an AI system aligned with anti-natalism that may view sterilizing all humans as permissible to prevent potential harm to new lives.
Incorporating safety-focused principles as strict, lexically first-ranked duties may help mitigate these risks, but balancing the conflicting demands of deontological ethics and safety remains a challenge. The article emphasizes that finding a reasonable way to incorporate absolute prohibitions into a broader decision theory is a complex problem requiring further research. Alternative ethical theories, such as libertarian deontology, may offer better safety assurances than traditional deontological ethics, but there is no simple route to AI safety within the realm of deontological ethics.

Comment by Evan R. Murphy on [deleted post] 2023-05-30T00:16:28.543Z

A couple of quick thoughts:

Very glad to see someone trying to provide more infrastructure and support for independent technical alignment researchers. Wishing you great success and looking forward to hearing how your project develops.
A lot of promising alignment research directions now seem to require access to cutting-edge models. A couple of ways you might deal with this could be:
- Partner with AI labs to help get your researchers access to their models
- Or focus on some of the few research directions such as mechanistic interpretability that still seem to be making useful progress on smaller, more accessible models

Comment by Evan R. Murphy on More information about the dangerous capability evaluations we did with GPT-4 and Claude. · 2023-05-26T23:51:14.169Z · LW · GW

We're working on a more thorough technical report.

Is the new Model evaluation for extreme risks paper the technical report you were referring to?

Comment by Evan R. Murphy on "notkilleveryoneism" sounds dumb · 2023-05-01T18:41:00.613Z · LW · GW

A few other possible terms to add to the brainstorm:

AI massive catastrophic risks
AI global catastrophic risks
AI catastrophic misalignment risks
AI catastrophic accident risks (paired with "AI catastrophic misuse risks")
AI weapons of mass destruction (WMDs) - Pro: a well-known term, Con: strongly connotes misuse so may be useful for that category but probably confusing to try and use for misalignment risks

Comment by Evan R. Murphy on [deleted post] 2023-04-26T20:40:48.982Z

As an aside, if you are located in Australia or New Zealand and would be interested in coordinating with me, please contact me through LessWrong on this account.

One potential source of leads for this might be the FLI Pause Giant AI Experiments open letter . I did a Ctrl+F search on there for "Australia" which had 50+ results and "New Zealand" which had 10+. So you might find some good people to connect with on there.

Comment by Evan R. Murphy on [deleted post] 2023-04-26T20:36:12.164Z

Upvoted. I think it's definitely worth pursuing well-thought out advocacy in countries besides US and China. Especially since this can be done in parallel with efforts in those countries.

A lot of people are working on the draft EU AI Act in Europe.

In Canada, parliament is considering Bill C-27 which may have a significant AI component. I do some work with an org called AIGS that is trying to help make that go well.

I'm glad to hear that some projects are underway in Australia and New Zealand and that you are pursuing this there!

Comment by Evan R. Murphy on WHO Biological Risk warning · 2023-04-25T19:32:09.858Z · LW · GW

Seems important, I'm guessing people are downvoting this considering it a possible infohazard.

Comment by Evan R. Murphy on Scaffolded LLMs as natural language computers · 2023-04-24T21:27:34.707Z · LW · GW

Post summary

I was interested in your post and noticed it didn't have a summary, so I generated one using a summarizer script I've been working on and iteratively improving:

Scaffolded Language Models (LLMs) have emerged as a new type of general-purpose natural language computer. With the advent of GPT-4, these systems have become viable at scale, wrapping a programmatic scaffold around an LLM core to achieve complex tasks. Scaffolded LLMs resemble the von-Neumann architecture, operating on natural language text rather than bits.
The LLM serves as the CPU, while the prompt and context function as RAM. The memory in digital computers is analogous to the vector database memory of scaffolded LLMs. The scaffolding code surrounding the LLM core implements protocols for chaining individual LLM calls, acting as the "programs" that run on the natural language computer.
Performance metrics for natural language computers include context length (RAM) and Natural Language OPerations (NLOPs) per second. Exponential improvements in these metrics are expected to continue for the next few years, driven by the increasing scale and cost of LLMs and their training runs.
Programming languages for natural language computers are in their early stages, with primitives like Chain of Thought, Selection-Inference, and Reflection serving as assembly languages. As LLMs improve and become more reliable, better abstractions and programming languages will emerge.
The execution model of natural language computers is an expanding Directed Acyclic Graph (DAG) of parallel NLOPs, resembling a dataflow architecture. Memory hierarchy in scaffolded LLMs currently has two levels, but as designs mature, additional levels may be developed.
Unlike digital computers, scaffolded LLMs face challenges in reliability, underspecifiability, and non-determinism. Improving the reliability of individual NLOPs is crucial for building powerful abstractions and abstract languages. Error correction mechanisms may be necessary to create coherent and consistent sequences of NLOPs.
Despite these challenges, the flexibility of LLMs offers great opportunities. The set of op-codes is not fixed but ever-growing, allowing for the creation of entire languages based on prompt templating schemes. As natural language programs become more sophisticated, they will likely delegate specific ops to the smallest and cheapest language models capable of reliably performing them.

If you have feedback on the quality of this summary, you can easily indicate that using LessWrong's agree/disagree voting.

Comment by Evan R. Murphy on Capabilities and alignment of LLM cognitive architectures · 2023-04-24T17:05:11.273Z · LW · GW

Post summary (experimental)

Here's an alternative summary of your post, complementing your TL;DR and Overview. This is generated by my summarizer script utilizing gpt-3.5-turbo and gpt-4. (Feedback welcome!)

The article explores the potential of language model cognitive architectures (LMCAs) to enhance large language models (LLMs) and accelerate progress towards artificial general intelligence (AGI). LMCAs integrate and expand upon approaches from AutoGPT, HuggingGPT, Reflexion, and BabyAGI, adding goal-directed agency, executive function, episodic memory, and sensory processing to LLMs. The author contends that these cognitive capacities will enable LMCAs to perform extensive, iterative, goal-directed "thinking" that incorporates topic-relevant web searches, thus increasing their effective intelligence.
While the acceleration of AGI timelines may be a downside, the author suggests that the natural language alignment (NLA) approach of LMCAs, which reason about and balance ethical goals similarly to humans, offers significant benefits compared to existing AGI and alignment approaches. The author also highlights the strong economic incentives for LMCAs, as computational costs are low for cutting-edge innovation, and individuals, small and large businesses are likely to contribute to progress. However, the author acknowledges potential difficulties and deviations in the development of LMCAs.
The article emphasizes the benefits of incorporating episodic memory into language models, particularly for decision-making and problem-solving. Episodic memory enables the recall of past experiences and strategies, while executive function focuses attention on relevant aspects of the current problem. The interaction between these cognitive processes can enhance social cognition, self-awareness, creativity, and innovation. The article also addresses the limitations of current episodic memory implementations in language models, which are limited to text files. However, it suggests that vector space search for episodic memory is possible, and language can encode multimodal information. The potential for language models to call external software tools, providing access to nonhuman cognitive abilities, is also discussed.
The article concludes by examining the implications of the NLA approach for alignment, corrigibility, and interpretability. Although not a complete solution for alignment, it is compatible with a hodgepodge alignment strategy and could offer a solid foundation for self-stabilizing alignment. The author also discusses the potential societal alignment problem arising from the development of LLMs with access to powerful open-source agents. While acknowledging LLMs' potential benefits, the author argues for planning against Moloch (a metaphorical entity representing forces opposing collective good) and accounting for malicious and careless actors. Top-level alignment goals should emphasize corrigibility, interpretability, harm reduction, and human empowerment/flourishing. The author also raises concerns about the unknown mechanisms of LLMs and the possibility of their output becoming deceptively different from the internal processing that generates it. The term x-risk AI (XRAI) is proposed to denote AI with a high likelihood of ending humanity. The author also discusses the principles of executive function and their relevance to LLMs, the importance of dopamine response in value estimation, and the challenges of ensuring corrigibility and interpretability in LMCA goals. In conclusion, the author suggests that while LLM development presents a wild ride, there is a fighting chance to address the potential societal alignment problem.

I may follow up with an object-level comment on your post, as I'm finding it super interesting but still digesting the content. (I am actually reading it and not just consuming this programmatic summary :)

Comment by Evan R. Murphy on Towards a solution to the alignment problem via objective detection and evaluation · 2023-04-24T15:39:43.556Z · LW · GW

Less compressed summary

Here's a longer summary of your article generated by the latest version of my summarizer script:

In this article, Paul Colognese explores whether detecting and evaluating the objectives of advanced AI systems during training and deployment is sufficient to solve the alignment problem. The idealized approach presented in the article involves detecting all objectives/intentions of any system produced during the training process, evaluating whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversibly bad, and shutting down a system if its objectives lead to irreversibly bad outcomes.
The alignment problem for optimizing systems is defined as needing a method of training/building optimizing systems such that they never successfully pursue an irreversibly bad objective during training or deployment and pursue good objectives while rarely pursuing bad objectives. The article claims that if an overseer can accurately detect and evaluate all of the objectives of optimizing systems produced during the training process and during deployment, the overseer can prevent bad outcomes caused by optimizing systems pursuing bad objectives.
Robustly detecting an optimizing system’s objectives requires strong interpretability tools. The article discusses the problem of evaluating objectives and some of the difficulties involved. The role of interpretability is crucial in this approach, as it allows the overseer to make observations that can truly distinguish between good systems and bad-but-good-looking systems.
Detecting all objectives in an optimizing system is a challenging task, and even if the overseer could detect all of the objectives, it might be difficult to accurately predict whether a powerful optimizing system pursuing those objectives would result in good outcomes or not. The article suggests that with enough understanding of the optimizing system’s internals, it might be possible to directly translate from the internal representation of the objective to a description of the relevant parts of the corresponding outcome.
The article concludes by acknowledging that the proposed solution seems difficult to implement in practice, but pursuing this direction could lead to useful insights. Further conceptual and empirical investigation is suggested to better understand the feasibility of this approach in solving the alignment problem.

Comment by Evan R. Murphy on No Summer Harvest: Why AI Development Won't Pause · 2023-04-21T22:39:53.427Z · LW · GW

My claim is that AI safety isn't part of the Chinese gestalt.

Stuart Russell claims that Xi Jinping has referred to the existential threat of AI to humanity [1].

[1] 5:52 of Russell's interview on Smerconish: https://www.cnn.com/videos/tech/2023/04/01/smr-experts-demand-pause-on-ai.cnn

Comment by Evan R. Murphy on Towards a solution to the alignment problem via objective detection and evaluation · 2023-04-20T21:06:48.296Z · LW · GW

Great idea, I will experiment with that - thanks!

Comment by Evan R. Murphy on Towards a solution to the alignment problem via objective detection and evaluation · 2023-04-20T02:24:47.840Z · LW · GW

Post summary (experimental)

I just found your post. I want to read it but didn't have time to dive into it thoroughly yet, so I put it into a summarizer script I've been working on that uses gpt-3.5-turbo and gpt-4 to summarize texts that exceed the context window length.

Here's the summary it came up with, let me know if anyone see problems with it. If you're in a rush you can use agree/disagree voting to signal whether you think this is overall a good summary or not:

The article examines a theoretical solution to the AI alignment problem, focusing on detecting and evaluating objectives in optimizing systems to prevent negative or irreversible outcomes. The author proposes that an overseer should possess three capabilities: detecting, evaluating, and controlling optimizing systems to align with their intended objectives.
Emphasizing the significance of interpretability, the article delves into the challenges of assessing objectives. As a practical solution, the author suggests developing tools to detect and evaluate objectives within optimizing systems and testing these methods through auditing games. Although implementing such tools may be difficult, the author advocates for further exploration in this direction to potentially uncover valuable insights. Acknowledging the theoretical nature of the solution, the author recognizes the potential hurdles that may arise during practical implementation.

Update: I see now that your post includes a High-level summary of this post (thanks for doing that!), which I'm going through and comparing with this auto-generated one.

Comment by Evan R. Murphy on The self-unalignment problem · 2023-04-20T01:19:55.693Z · LW · GW

Post summary (auto-generated, experimental)

I am working on a summarizer script that uses gpt-3.5-turbo and gpt-4 to summarize longer articles (especially AI safety-related articles). Here's the summary it generated for the present post.

The article addresses the issue of self-unalignment in AI alignment, which arises from the inherent inconsistency and incoherence in human values and preferences. It delves into various proposed solutions, such as system boundary alignment, alignment with individual components, and alignment through whole-system representation. However, the author contends that each approach has its drawbacks and emphasizes that addressing self-unalignment is essential and cannot be left solely to AI.
The author acknowledges the difficulty in aligning AI with multiple potential targets due to humans' lack of self-alignment. They argue that partial solutions or naive attempts may prove ineffective and suggest future research directions. These include developing a hierarchical agency theory and investigating Cooperative AI initiatives and RAAPs. The article exemplifies the challenge of alignment with multiple targets through the case of Sydney, a conversational AI interacting with an NYT reporter.
Furthermore, the article highlights the complexities of aligning AI with user objectives and Microsoft's interests, discussing the potential risks and uncertainties in creating such AI systems. The author underscores the need for an explicit understanding of how AI manages self-unaligned systems to guarantee alignment with the desired outcome. Ultimately, AI alignment proposals must consider the issue of self-unalignment to prevent potential catastrophic consequences.

Let me know any issues you see with this summary. You can use the agree/disagree voting to help rate the quality of the summary if you don't have time to comment - you'll be helping to improve the script for summarizing future posts. I haven't had a chance to read this article in full yet (hence my interest in generating a summary for it!). So I don't know how good this particular summary is yet, though I've been testing out the script and improving it on known texts.

Comment by Evan R. Murphy on The ‘ petertodd’ phenomenon · 2023-04-20T01:08:39.834Z · LW · GW

New summary that's 'less wrong' (but still experimental)

I've been working on improving the summarizer script. Here's the summary auto-generated by the latest version, using better prompts and fixing some bugs:

The author investigates a phenomenon in GPT language models where the prompt "petertodd" generates bizarre and disturbing outputs, varying across different models. The text documents experiments with GPT-3, including hallucinations, transpositions, and word associations. Interestingly, "petertodd" is associated with character names from the Japanese RPG game, Puzzle & Dragons, and triggers themes such as entropy, destruction, domination, and power-seeking in generated content.

The text explores the origins of "glitch tokens" like "petertodd", which can result in unpredictable and often surreal outputs. This phenomenon is studied using various AI models, with the "petertodd" prompt producing outputs ranging from deity-like portrayals to embodiments of ego death and even world domination plans. It also delves into the connections between "petertodd" and other tokens, such as "Leilan", which is consistently associated with a Great Mother Goddess figure.

The article includes examples of AI-generated haikus, folktales, and character associations from different cultural contexts, highlighting the unpredictability and complexity of GPT-3's associations and outputs. The author also discusses the accidental discovery of the "Leilan" token and its negligent inclusion in the text corpus used to generate it.

In summary, the text provides a thorough exploration of the "petertodd" phenomenon, analyzing its implications and offering various examples of AI-generated content. Future posts aim to further analyze this phenomenon and its impact on AI language models.

I think it's a superior summary, no longer hallucinating narratives about language models in society and going more in detail on interesting parts of the post. It was unable to preserve ' petertodd' and ' Leilan' with single quotes and leading spaces from the OP though. Also I feel like it is clumsy how the summary brings up "Leilan" twice.

Send a reply if anyone sees additional problems with this new summary, or has other feedback on it.

Comment by Evan R. Murphy on The ‘ petertodd’ phenomenon · 2023-04-18T04:01:40.790Z · LW · GW

Great feedback, thanks! Looks like GPT-4 ran away with its imagination a bit. I'll try to fix that.

Comment by Evan R. Murphy on The ‘ petertodd’ phenomenon · 2023-04-17T22:44:40.995Z · LW · GW

Post summary (experimental)

Here's an experimental summary of this post I generated using gpt-3.5-turbo and gpt-4:

This article discusses the 'petertodd' phenomenon in GPT language models, where the token prompts the models to generate disturbing and violent language. While the cause of the phenomenon remains unexplained, the article explores its implications, as language models become increasingly prevalent in society. The author provides examples of the language generated by the models when prompted with 'petertodd', which vary between models. The article also discusses glitch tokens and their association with cryptocurrency and mythological themes, as well as their potential to prompt unusual responses. The text emphasizes the capabilities and limitations of AI in generating poetry and conversation. Overall, the article highlights the varied and unpredictable responses that can be generated when using 'petertodd' as a prompt in language models.

Let me know if anyone sees issues with this summary or has suggestions for making it better, as I'm trying to improve my summarizer script.

User info

Posts

Comments