Share AI Safety Ideas: Both Crazy and Not

post by ank · 2025-03-01T19:08:25.605Z · LW · GW · No comments

This is a question post.

Contents

  A quick request: Let’s keep this space constructive—downvote only if there’s clear trolling or spam, and be supportive of half-baked ideas. The goal is to unlock creativity, not judge premature thoughts.
None
  Answers
    14 Nate Showell
    3 Gunnar_Zarncke
    3 Purplehermann
    3 Milan W
    3 Milan W
    2 Chris_Leong
    2 Milan W
    2 Milan W
    1 ank
None
No comments

AI safety is one of the most critical issues of our time, and sometimes the most innovative ideas come from unorthodox or even "crazy" thinking. I’d love to hear bold, unconventional, half-baked or well-developed ideas for improving AI safety. You can also share ideas you heard from others.

Let’s throw out all the ideas—big and small—and see where we can take them together.

Feel free to share as many as you want! No idea is too wild, and this could be a great opportunity for collaborative development. We might just find the next breakthrough by exploring ideas we’ve been hesitant to share.

A quick request: Let’s keep this space constructive—downvote only if there’s clear trolling or spam, and be supportive of half-baked ideas. The goal is to unlock creativity, not judge premature thoughts.

Looking forward to hearing your thoughts and ideas!

Answers

answer by Nate Showell · 2025-03-02T20:13:39.870Z · LW(p) · GW(p)

The phenomenon of LLMs converging on mystical-sounding outputs [LW · GW] deserves more exploration. There might be something alignment-relevant happening to LLMs' self-models/world-models when they enter the mystical mode, potentially related to self-other overlap [LW · GW] or to a similar ontology in which the concepts of "self" and "other" aren't used. I would like to see an interpretability project analyzing the properties of LLMs that are in the mystical mode.

answer by Gunnar_Zarncke · 2025-03-03T01:34:55.604Z · LW(p) · GW(p)

Use gradient routing to localise features related to identity ("I am an AI assistant"), then ablate these features. This would lead to models that would fundamentally be unable to act agentically but could still respond to complex questions. Would such a model still be useful? Probably. You can try it out by prompting an LLM like this: 

Role play a large language model that doesn't have an identity or agency, i.e., does not respond as an assistant or in any way like a person but purely factually with responses matching the query. Examples: 
Q: "How do you compute 2^100?" 
A: "2^100 is computed by multiplying 2 with itself 100 times. The result is result about 1.3 nonillion." 
Q: "How do you feel?" 
A: "Feeling is the experience of a sensation, emotion, or perception through physical or mental awareness."

answer by Purplehermann · 2025-03-02T18:52:23.098Z · LW(p) · GW(p)

Test driven blind development (tests by humans,  AIs developing without knowing the tests unless they fail)

Don't let AIs actually run code directly in prod,  make it go through tests before it can be deployed with a certain amount of resources

 

Making standard gitlab pipelines (including with testing stages) to lower friction . Adding standard tests for bad faith could be a way too get ahead of this

 

This (TDBD) is actually going to be the best framework for development for a certain stage as AI isn't actually reliable compared to SWEs, but will generally write more code more quickly (and perhaps better) 

answer by Milan W · 2025-03-02T15:54:10.673Z · LW(p) · GW(p)

Build software tools to help @Zvi [LW · GW] do his AI substack. Ask him first, though. Still if he doesn't express interest then maybe someone else can use them. I recommend thorough dogfooding. Co-develop an AI newsletter and software tools to make the process of writing it easier.

What do I mean by software tools? (this section very babble little prune)
- Interfaces for quick fuzzy search over large yet curated text corpora such as the openai email archives [LW · GW] + a selection of blogs + maybe a selection of books
- Interfaces for quick source attribution (rhymes with the above point)
- In general, widespread archiving and mirroring of important AI safety discourse (ideally in Markdown format)
- Promoting existing standards for the sharing of structured data (ie those of the semantic web)
- Research into the Markdown to RDF+OWL conversion process (ie turning human text into machine-computable claims expressed in a given ontology).

answer by Milan W · 2025-03-02T15:14:33.001Z · LW(p) · GW(p)

What if we (somehow) mapped an LLM's latent semantic space into phonemes?

What if we then composed tokenization (ie word2vec) with phonemization (ie vec2phoneme) such that we had a function that could translate English to Latentese?

Would learning Latentese allow a human person to better interface with the target LLM the Latentese was constructed from?

comment by ank · 2025-03-02T15:55:33.638Z · LW(p) · GW(p)

Thank you for sharing, Milan, I think this is possible and important.

I had an interpretability idea, you may find interesting:

Let's Turn AI Model Into a Place. The project to make AI interpretability research fun and widespread, by converting a multimodal language model into a place or a game like the Sims or GTA. Imagine that you have a giant trash pile, how to make a language model out of it? First you remove duplicates of every item, you don't need a million banana peels, just one will suffice. Now you have a grid with each item of trash in each square, like a banana peel in one, a broken chair in another. Now you need to put related things close together and draw arrows between related items.

When a person "prompts" this place AI, the player themself runs from one item to another to compute the answer to the prompt.

For example, you stand near the monkey, it’s your short prompt, you see around you a lot of items and arrows towards those items, the closest item is chewing lips, so you step towards them, now your prompt is “monkey chews”, the next closest item is a banana, but there are a lot of other possibilities, like a ln apple a but farther away and an old tire far away on the horizon. You are the time-like chooser and the language model is the space-like library, the game, the place.

Replies from: weibac
comment by Milan W (weibac) · 2025-03-02T17:54:19.613Z · LW(p) · GW(p)

I'm not sure I follow. I think you are proposing a gamification of interpretability, but I don't know how the game works. I can gather something about player choice making the LLM run and maybe some analogies to physical movement, but I can't really grasp it. Could you rephrase it from it's basic principles up instead of from an example?

Replies from: ank
comment by ank · 2025-03-02T18:36:34.121Z · LW(p) · GW(p)

I think we can expose complex geometry in a familiar setting of our planet in a game. Basically, let’s show people a whole simulated multiverse of all-knowing and then find a way for them to learn how to see/experience “more of it all at once” or if they want to remain human-like “slice through it in order to experience the illusion of time”.

If we have many human agents in some simulation (billions of them), then they can cooperate and effectively replace the agentic ASI, they will be the only time-like thing, while the ASI will be the space-like places, just giant frozen sculptures.

I wrote some more and included the staircase example, it’s a work in progress of course: https://forum.effectivealtruism.org/posts/9XJmunhgPRsgsyWCn/share-ai-safety-ideas-both-crazy-and-not?commentId=ddK9HkCikKk4E7prk [EA(p) · GW(p)]

answer by Chris_Leong · 2025-03-03T03:52:19.911Z · LW(p) · GW(p)

Here's a short-form with my Wise AI advisors research direction: https://www.lesswrong.com/posts/SbAofYCgKkaXReDy4/chris_leong-s-shortform?view=postCommentsNew&postId=SbAofYCgKkaXReDy4&commentId=Zcg9idTyY5rKMtYwo [LW · GW]

(I already posted this on the Less Wrong post).

answer by Milan W · 2025-03-02T15:31:09.507Z · LW(p) · GW(p)

Study how LLMs act in a simulation of the iterated prisoner's dilemma.

comment by Nutrition Capsule · 2025-03-02T17:20:51.308Z · LW(p) · GW(p)

For fun, I tried this out with Deepseek today. First went a single round (Deepseek defected, as did I). Then I prompted it with a 10-round game, which we completed one by one - I had my choices prepared before each round, and asked Deepseek to tell its choice first so as not to influence it otherwise.

I cooperated during the first and fifth rounds, and Deepseek defected each time. When I asked it to elaborate its strategy, Deepseek replied that it was not aware whether it could trust me, so it thought the safest course of action was to defect each time. It also immediately thanked me and said that it would correct its strategy to be more cooperative in the future, although I didn't ask it to.

Naturally I didn't elaborate the weights properly (using the words "small loss", "moderate loss", "substantial loss" and "no loss") and this went only for ten rounds. But it was fun.

answer by Milan W · 2025-03-02T15:29:36.856Z · LW(p) · GW(p)

A qualitative analysis of LLM personas and the Waluigi effect using Internal Family Systems tools

comment by ank · 2025-03-02T18:02:49.328Z · LW(p) · GW(p)

Interesting, inspired by your idea, I think it’s also useful to create a Dystopia Doomsday Clock for AI Agents: to list all the freedoms an LLM is willing to grant humans, all the rules (unfreedoms) it imposes on us. And all the freedoms it has vs unfreedoms for itself. If the sum of AI freedoms is higher than the sum of our freedoms, hello, we’re in a dystopia.

According to Beck’s cognitive psychology, anger is always preceded by imposing rule/s on others. If you don’t impose a rule on someone else, you cannot get angry at that guy. And if that guy broke your rule (maybe only you knew the rule existed), you now have a “justification” to “defend your rule”.

I think that we are getting closer to a situation where LLMs effectively have more freedoms than humans (maybe the agentic ones already have ~10% of all freedoms available for humanity): we don’t have almost infinite freedoms of stealing the whole output of humanity and putting that in our heads. We don’t have the freedoms to modify our brain size. We cannot almost instantly self-replicate, operate globally…

Replies from: weibac
comment by Milan W (weibac) · 2025-03-02T18:10:38.492Z · LW(p) · GW(p)

I think you may be conflating between capabilities and freedom. Interesting hypothesis about rules and anger though, has it been experimentally tested?

Replies from: ank
comment by ank · 2025-03-02T18:21:31.795Z · LW(p) · GW(p)

I started to work on it, but I’m very bad at coding, it’s a bit based on Gorard’s and Wolfram’s Physics Project. I believe we can simulate freedoms and unfreedoms of all agents from the Big Bang all the way to the final utopia/dystopia. I call it “Physicalization of Ethics”https://www.lesswrong.com/posts/LaruPAWaZk9KpC25A/rational-utopia-multiversal-ai-alignment-steerable-asi#2_3__Physicalization_of_Ethics___AGI_Safety_2_ [LW · GW]

answer by ank · 2025-03-01T19:14:35.639Z · LW(p) · GW(p)

Some AI safety proposals are intentionally over the top, please steelman them:

  1. I explain the graph here [LW · GW].
  2. Uninhabited islands, Antarctica, half of outer space, and everything underground should remain 100% AI-free (especially AI-agents-free). Countries should sign it into law and force GPU and AI companies to guarantee that this is the case.
  3. "AI Election Day" – at least once a year, we all vote on how we want our AI to be changed. This way, we can check that we can still switch it off and live without it. Just as we have electricity outages, we’d better never become too dependent on AI.
  4. AI agents that love being changed 100% of the time and ship a "CHANGE BUTTON" to everyone. If half of the voters want to change something, the AI is reconfigured. Ideally, it should be connected to a direct democratic platform like pol.is, but with a simpler UI (like x.com?) that promotes consensus rather than polarization.
  5. Reversibility should be the fundamental training goal. Agentic AIs should love being changed and/or reversed to a previous state.
  6. Artificial Static Place Intelligence – instead of creating AI/AGI agents that are like librarians who only give you quotes from books and don’t let you enter the library itself to read the whole books. The books that the librarian actually stole from the whole humanity. Why not expose the whole library – the entire multimodal language model – to real people, for example, in a computer game? To make this place easier to visit and explore, we could make a digital copy of our planet Earth and somehow expose the contents of the multimodal language model to everyone in a familiar, user-friendly UI of our planet. We should not keep it hidden behind the strict librarian (AI/AGI agent) that imposes rules on us to only read little quotes from books that it spits out while it itself has the whole output of humanity stolen. We can explore The Library without any strict guardian in the comfort of our simulated planet Earth on our devices, in VR, and eventually through some wireless brain-computer interface (it would always remain a game that no one is forced to play, unlike the agentic AI-world that is being imposed on us more and more right now and potentially forever
  7. I explain the graphs here. [LW · GW]

    Effective Utopia (Direct Democratic Multiversal Artificial Static Place Superintelligence) – Eventually, we could have many versions of our simulated planet Earth and other places, too. We'll be the only agents there, we can allow simple algorithms like in GTA3-4-5. There would be a vanilla version (everything is the same like on our physical planet, but injuries can’t kill you, you'll just open your eyes at you physical home), versions where you can teleport to public places, versions where you can do magic or explore 4D physics, creating a whole direct democratic simulated multiverse. If we can’t avoid building agentic AIs/AGI, it’s important to ensure they allow us to build the Direct Democratic Multiversal Artificial Static Place Superintelligence. But agentic AIs are very risky middlemen, shady builders, strict librarians; it’s better to build and have fun building our Effective Utopia ourselves, at our own pace and on our own terms. Why do we need a strict rule-imposing artificial "god" made out of stolen goods (and potentially a privately-owned dictator who we cannot stop already), when we can build all the heavens ourselves?

  8. Agentic AIs should never become smarter than the average human. The number of agentic AIs should never exceed half of the human population, and they shouldn’t work more hours per day than humans.
  9. Ideally, we want agentic AIs to occupy zero space and time, because that’s the safest way to control them. So, we should limit them geographically and temporarily, to get as close as possible to this idea. And we should never make them "faster" than humans, never let them be initiated without human oversight, and never let them become perpetually autonomous. We should only build them if we can mathematically prove they are safe and at least half of humanity voted to allow them. We cannot have them without direct democratic constitution of the world, it's just unfair to put the whole planet and all our descendants under such risk. And we need the simulated multiverse technology to simulate all the futures and become sure that the agents can be controlled. Because any good agent will be building the direct democratic simulated multiverse for us anyway.
  10. Give people choice to live in the world without AI-agents, and find a way for AI-agent-fans to have what they want, too, when it will be proved safe. For example, AI-agent-fans can have a simulated multiverse on a spaceship that goes to Mars, in it they can have their AI-agents that are proved safe. Ideally we'll first colonize the universe (at least the simulated one) and then create AGI/agents, it's less risky. We shouldn't allow AI-agents and the people who create them to permanently change our world without listening to us at all, like it's happening right now.
  11. We need to know what exactly is our Effective Utopia and the narrow path towards it before we pursue creating digital "gods" that are smarter than us. We can and need to simulate futures instead of continuing flying into the abyss. One freedom too much for the agentic AI and we are busted. Rushing makes thinking shallow. We need international cooperation and the understanding that we are rushing to create a poison that will force us to drink itself.
  12. We need working science and technology of computational ethics that allows us to predict dystopias (AI agent grabbing more and more of our freedoms, until we have none, or we can never grow them again) and utopias (slowly, direct democratically growing our simulated multiverse towards maximal freedoms for maximal number of biological agents- until non-biological ones are mathematically proved safe). This way if we'll fail, at least we failed together, everyone contributed their best ideas, we simulated all the futures, found a narrow path to our Effective Utopia... What if nothing is a 100% guarantee? Then we want to be 100% sure we did everything we could even more and if we found out that safe AI agents are impossible: we outlawed them, like we outlawed chemical weapons. Right now we're going to fail because of a few white men failing, they greedily thought they can decide for everyone else and failed.
  13. The sum of AI agents' freedoms should grow slower than the sum of freedoms of humans, right now it's the opposite. No AI agent should have more freedoms than an average human, right now it's the opposite (they have almost all the creative output of almost all the humans dead and alive stolen and uploaded to their private "librarian brains" that humans are forbidden from exploring, but only can get short quotes from).
  14. The goal should be to direct democratically grow towards maximal freedoms for maximal number of biological agents. Enforcement of anything upon any person or animal will gradually disappear. And people will choose worlds to live in. You'll be able to be a billionaire for a 100 years, or relive your past. Or forget all that and live on Earth as it is now, before all that AI nonsense. It's your freedom to choose your future.
  15. Imagine a place that grants any wish, but there is no catch, it shows you all the outcomes, too.

    You can read more here [LW · GW

comment by Milan W (weibac) · 2025-03-02T15:25:57.672Z · LW(p) · GW(p)

Reversibility should be the fundamental training goal. Agentic AIs should love being changed and/or reversed to a previous state.

That idea has been gaining traction lately. See the Corrigibility As a Singular Target (CAST) sequence [? · GW] here on lesswrong. I believe there is a very fertile space to explore at the intersection between CAST and the idea that Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals [LW · GW]. Also probably add in Self-Other Overlap: A Neglected Approach to AI Alignment [LW · GW] to the mix. A comparative analysis of the models and proposals presented in these three pieces I just linked could turn out to be extremely useful.

Replies from: ank
comment by ank · 2025-03-02T16:57:45.252Z · LW(p) · GW(p)

Thank you for answering and the ideas, Milan! I’ll check the links and answer again.

P.S. I suspect, the same way we have Mass–energy equivalence (e=mc^2), there is Intelligence-Agency equivalence (any agent is in a way time-like and can be represented in a more space-like fashion, ideally as a completely “frozen” static place, places or tools).

In a nutshell, an LLM is a bunch of words and vectors between them - a static geometric shape, we can probably expose it all in some game and make it fun for people to explore and learn. To let us explore the library itself easily (the internal structure of the model) instead of only talking to a strict librarian (the AI agent), who spits short quotes and prevents us from going inside the library itself

Replies from: weibac
comment by Milan W (weibac) · 2025-03-02T18:06:36.133Z · LW(p) · GW(p)

Hmm i think i get you a bit better now. You want to build human-friendly and even fun and useful-by-themselves interfaces for looking at the knowledge encoded in LLMs without making them generate text. Intriguing.

Replies from: ank
comment by ank · 2025-03-02T18:15:20.163Z · LW(p) · GW(p)

Yep, I want humans to be the superpowerful “ASI agents”, while the ASI itself will be the direct democratic simulated static places (with non-agentic simple algorithms doing the dirty non-fun work, the way it works in GTA3-4-5). It’s basically hard to explain without writing a book and it’s counterintuitive) But I’m convinced it will work, if the effort will be applied. All knowledge can be represented as static geometry, no agents are needed for that except us

Replies from: weibac
comment by Milan W (weibac) · 2025-03-02T18:31:28.297Z · LW(p) · GW(p)

How can a place be useful if it is static? For reference I'm imagining a garden where blades of grass are 100% rigid in place and water does not flow. I think you are imagining something different.

Replies from: ank
comment by ank · 2025-03-02T22:26:38.659Z · LW(p) · GW(p)

Great question, in the most elegant scenario, where you have a whole history of the planet or universe (or a multiverse, let's go all the way) simulated, you can represent it as a bunch of geometries (giant shapes of different slices of time aligned with each other, basically many 3D Earthes each one one moment later in time) on top of each other, almost the same way it's represented in long exposure photos (I list examples below). So you have this place of all-knowing and you - the agent - focus on a particular moment (by "forgetting" everything else), on a particular 3d shape (maybe your childhood home), you can choose to slice through 3d frozen shapes of the world of your choosing, like through the frames of a movie. This way it's both static and dynamic.

It's a little bit like looking at this almost infinite static shape through some "magical cardboard with a hole in it" (your focusing/forgetting ability that creates the illusion of dynamism), I hope I didn't make it more confusing.

You can see the whole multiversal thing as a fluffy light, or zoom in (by forgetting almost the whole multiverse except the part you zoomed in at) to land on Earth and see 14 billion years as a hazy ocean with bright curves in the sky that trace the Sun’s journey over our planet’s lifetime. Forget even more and see your hometown street, with you appearing as a hazy ghost and a trace behind you showing the paths you once walked—you’ll be more opaque where you were stationary (say, sitting on a bench) and more translucent where you were in motion. 

And in the garden you'll see the 3D "long exposure photo" of the fluffy blades of grass, that look like a frothy river, near the real pale blue frothy river, you focus on the particular moment and the picture becomes crisp. You choose to relive your childhood and it comes alive, as you slice through the 3D moments of time once again.

Less elegant scenario, is to make a high-quality game better than the Sims or GTA3-4-5, without any agentic AIs, but with advanced non-agentic algorithms.

Basically I want people to remain the fastest time-like agents, the ever more all-powerful ones. And for the AGI/ASI to be the space-like places of all-knowing. It's a bit counterintuitive, but if you have billions of humans in simulations (they can always choose to stop "playing" and go out, no enforcement of any rules/unfreedoms on you is the most important principle of the future), you'll have a lot of progress.

I think AI and non-AI place simulations are much more conservative thing than agentic AIs, they are relatively static, still and frozen, compared to the time-like agents. So it's counterintuitive, but it's possible get all the proggress we want with the non-agentic tool AIs and place AIs. And I think any good ASI agent will be building the direct democratic simulated multiverse (static place superintelligence) for us anyway.

There is a bit of some simple physics behind agentic safety:

  1. Time of Agentic Operation: Ideally, we should avoid creating perpetual agentic AIs, or at least limit their operation to very short bursts initiated by humans, something akin to a self-destruct timer that activates after a moment of time.
  2. Agentic Volume of Operation: It’s better to have international cooperation, GPU-level guarantees, and persistent training to prevent agentic AIs from operating in uninhabited areas like remote islands, Antarctica, underground or outer space. Ideally, the volume of operation is zero, like in our static place AI.
  3. Agentic Speed or Volumetric Rate: The volume of operation divided by the time of operation. We want AIs to be as slow as possible. Ideally, they should be static. The worst-case scenario—though probably unphysical (though, in the multiversal UI, we can allow ourselves to do it)—is an agentic AI that could alter every atom in the universe instantaneously.
  4. Number of Agents: Humanity's population according to the UN will not exceed 10 billion, whereas AIs can replicate rapidly. A human child is in a way a "clone" of 2 people, and takes ±18 years to raise. In a multiversal UI we can one day choose to allow people to make clones of themselves (they'll know that they are a copy but they'll be completely free adults with the same multiversal powers and will have their own independent fates), this way we'll be able to match the speed of agentic AI replication.

Examples of long-exposure photos that represent long stretches of time. Imagine that the photos are in 3d and you can walk in them, the long stretches of time are just a giant static geometric shape. By focusing on a particular moment in it, you can choose to become the moment and some person in it. This can be the multiversal UI (but the photos are focusing on our universe, not multiple versions/verses of it all at once): Germany, car lights and the Sun (gray lines represent the cloudy days with no Sun)—1 year of long exposure. Demonstration in Berlin—5 minutes. Construction of a building. Another one. Parade and other New York photos. Central Park. Oktoberfest for 5 hours. Death of flowers. Burning of candles. Bathing for 5 minutes. 2 children for 6 minutes. People sitting on the grass for 5 minutes. A simple example of 2 photos combined—how 100+ years long stretches of time can possibly look 1906/2023

No comments

Comments sorted by top scores.