michael-soareverix

Posts
Comments

Posts

Detecting AI Agent Failure Modes in Simulations 2025-02-11T11:10:26.030Z

Pivotal Acts are easier than Alignment? 2024-07-21T12:15:12.818Z

Optimizing for Agency? 2024-02-14T08:31:16.157Z

The Virus - Short Story 2023-04-13T18:18:48.068Z

Gold, Silver, Red: A color scheme for understanding people 2023-03-13T01:06:39.703Z

A Good Future (rough draft) 2022-10-24T20:45:45.029Z

A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly. 2022-09-08T15:20:49.829Z

Our Existing Solutions to AGI Alignment (semi-safe) 2022-07-21T19:00:44.212Z

Musings on the Human Objective Function 2022-07-15T07:13:19.711Z

Three Minimum Pivotal Acts Possible by Narrow AI 2022-07-12T09:51:24.077Z

Could an AI Alignment Sandbox be useful? 2022-07-02T05:06:09.979Z

Comments

Comment by Michael Soareverix (michael-soareverix) on Pivotal Acts are easier than Alignment? · 2024-07-22T03:14:13.223Z · LW · GW

Is it possible to develop specialized (narrow) AI that surpasses every human at infecting/destroying GPU systems, but won't wipe us out? LLM-powered Stuxnet would be an example. Bacteria isn't smarter than humans, but it is still very dangerous. It seems like a digital counterpart could prevent GPUs and so, prevent AGI.

(Obviously, I'm not advocating for this in particular since it would mean the end of the internet and I like the internet. It seems likely, however, that there are pivotal acts possible by narrow AI that prevent AGI without actually being AGI.)

Comment by Michael Soareverix (michael-soareverix) on Optimizing for Agency? · 2024-02-15T06:40:30.249Z · LW · GW

Super interesting!

There's a lot of information here that will be super helpful for me to delve into. I've been bookmarking your links.

I think optimizing for the empowerment of other agents is a better target than giving the AI all the agency and hoping that it creates agency for people as a side-effect to maximizing something else. I'm glad to see there's lots of research happening on this and I'll be checking out 'empowerment' as an agency term.

Agency doesn't equal 'goodness', but it seems like an easier target to hit. I'm trying to break down the alignment problem into slices to figure it out and agency seems like a key slice.

Comment by Michael Soareverix (michael-soareverix) on An AI risk argument that resonates with NYTimes readers · 2023-03-13T01:11:36.757Z · LW · GW

Great post. This type of genuine comment (human-centered rather than logically abstract) seems like the best way to communicate the threat to non-technical people. I've tried talking about the problem to friends in social sciences and haven't found a good way to convey how serious I feel about it and how there is no current logical prevention of this problem.

Comment by Michael Soareverix (michael-soareverix) on AI Governance & Strategy: Priorities, talent gaps, & opportunities · 2023-03-03T21:06:37.721Z · LW · GW

Hey Akash, I sent you a message about my summer career plans and how I can bring AI Alignment into that. I'm a senior in college who has a few relevant skills and I'd really like to connect with some professionals in the field. I'd love to connect or learn from you!

Comment by Michael Soareverix (michael-soareverix) on A Good Future (rough draft) · 2023-02-26T00:48:35.516Z · LW · GW

Yeah, this makes sense. However, I can honestly see myself reverting my intelligence a bit at different junctures, the same way I like to replay video games at greater difficulty. The main reason I am scared of reverting my intelligence now is that I have no guarantee of security that something awful won't happen to me. With my current ability, I can be pretty confident that no one is going to really take advantage of me. If I were a child again, with no protection or less intelligence, I can easily imagine coming to harm because of my naivete.

I also think singleton AI is inevitable (and desirable). This is simply because it is stable. There's no conflict between superintelligences. I do agree with the idea of a Guardian Angel type AI, but I think it would still be an offshoot of that greater singleton entity. For the most part, I think most people would forget about the singleton AI and just perceive it as part of the universe the same way gravity is part of the universe. Guardian Angels could be a useful construct, but I don't see why they wouldn't be part of the central system.

Finally, I do think you're right about not wanting to erase memories for entering a simulation. I think there would be levels, and most people would want to stay at a pretty normal level and would move to more extreme levels slowly before deciding on some place to stay.

I appreciate the comment. You've made me think a lot. The key idea behind this utopia is the idea of choice. You can basically go anywhere, do anything. Everyone will have different levels of comfort with the idea of altering their identity, experience, or impact. If you'd want to live exactly in the year 2023 again, there would be a physical, earth-like planet where you could do that! I think this sets a good baseline so that no one is unhappy.

Comment by Michael Soareverix (michael-soareverix) on How it feels to have your mind hacked by an AI · 2023-02-13T19:07:22.323Z · LW · GW

I've combined it with image generation to bring someone back from the dead and it just leaves me shaken how realistic it is. I can be surprised. It genuinely feels like a version of them

Comment by Michael Soareverix (michael-soareverix) on A Good Future (rough draft) · 2022-10-25T08:11:43.955Z · LW · GW

Thanks! I think I can address a few of your points with my thoughts.

(Also, I don't know how to format a quote so I'll just use quotation marks)

"It seems inefficient for this person to be disconnected from the rest of humanity and especially from "god". In fact, the AI seems like it's too small of an influence on the viewpoint character's life."

The character has chosen to partially disconnect themselves from the AI superintelligence because they want to have a sense of agency, which the AI respects. It's definitely inefficient, but that is kind of the point. The AI has a very subtle presence that isn't noticeable, but it will intervene if a threshold is going to be crossed. Some people, including myself, instinctively dislike the idea of an AI controlling all of our actions and would like to operate as independently as possible from it.

"The worlds with maximized pleasure settings sound a little dangerous and potentially wirehead-y. A properly aligned AGI probably would frown on wireheading."

I agree. I imagine that these worlds have some boundary conditions. Notably, the pleasure isn't addictive (once you're removed from it, you remember it being amazing but don't feel an urge to necessarily go back) and there are predefined limits, either set by the people in them or by the AI. I imagine a lot of variation in these worlds, like a world where your sense of touch is extremely heightened and turned into pleasure and you can wander through feeling all sorts of ecstatic textures.

"If you create a simulated world where simulated beings are real and have rights, that simulation becomes either less ethical or less optimized for your utility. Simulated beings should either be props without qualia or granted just as much power as the "real" beings if the universe is to be truly fair."

The simulation that the character has built (the one I intend to build) has a lot of real people in it. When those people 'die', they go back to the real world and can choose to be reborn into the simulation again later. In a sense, this simulated world is like Earth, and the physical world is like Heaven. There is meaning in the simulation because of how you interact with others.

There is also simulated life, but it is all an offshoot of the AI. Basically, there's this giant pool of consciousness from the AI, and little bits of it are split off to create 'life', like a pet animal. When that pet dies, the consciousness is reabsorbed into the whole and then new life can emerge once again.

Humans can also choose to merge with this pool of simulated consciousness, and theoretically, parts of this consciousness can also decide to enter the real world. There is no true 'death' or suffering in the way that there is today, except for those like the human players who open themselves to it.

"Inefficiency like creating a planet where a simulation would do the same thing but better seems like an untenable waste of resources that could be used on more simulations."

This is definitely true! But the AI allows people to choose what to do and prevents others from over-optimizing. Some people genuinely just want to live in a purely physical world, even if they can't tell the difference, and there is definitely something special about physical reality, given that we started out here. It is their right, even if it is inefficient. We are not optimizing for efficiency, just choice. Besides, there is so much other simulation power that it isn't really needed. In the same sense, the superminds playing 100-dimensional chess are inefficient, even if it's super cool. The key here is choice.

"When simulated worlds are an option to this degree, it seems ridiculous to believe that abstaining from simulations altogether would be an optimal action to take in any circumstance. Couldn't you go to a simulation optimized for reading, a simulation optimized for hot chocolate, etc.? Partaking of such things in the real world also seems to be a waste of resources"

Another good point! The point is that you have so many resources you don't need to optimize if you don't want to. Sure, you can have a million tastier simulated hot chocolates for every real one, but you might just have it be real just because you can. I'm in a pattern where given the choice, I'd probably choose the real option, even knowing the inefficiency, just because it's comfortable. And the AI supermind won't attempt to persuade me differently, even if it knows my choice is inoptimal.

The important keys of this future are its diversity (endless different types of worlds) and the importance of choice in almost every situation except when there is undesired suffering. In my eyes, there are three nice things to optimize toward in life: Identity, Experience, and Impact. Optimizing purely for an experience like pleasure seems dangerous. It really seems to me that there can be meaning in suffering, like when I work out to become stronger (improving identity) or to help others (impact).

I'll read through the Fun Theory sequence and see if it updates my beliefs. I appreciate the comment!

Comment by Michael Soareverix (michael-soareverix) on Losing the root for the tree · 2022-10-18T17:28:02.238Z · LW · GW

This post is identical to how I started thinking about life a few years ago. Every goal can be broken into subgoals.

I actually made a very simple web app a few years ago to do this: https://dynamic-goal-tree-soareverix--soareverix.repl.co/

It's not super aesthetic, but it has the same concept of infinitely expanding goals.

Amazing post, by the way. The end gave me chills and really puts it all into perspective.

Comment by Michael Soareverix (michael-soareverix) on A rough idea for solving ELK: An approach for training generalist agents like GATO to make plans and describe them to humans clearly and honestly. · 2022-09-09T19:12:06.780Z · LW · GW

I'm not sure exactly what you mean. If we get an output that says "I am going to tell you that I am going to pick up the green crystals, but I'm really going to pick up the yellow crystals", then that's a pretty good scenario, since we still know its end behavior.

I think what you mean is the scenario where the agent tells us the truth the entire time it is in simulation but then lies in the real world. That is definitely a bad scenario. And this model doesn't prevent that from happening.

There are ideas that do (deception takes additional compute vs honesty, so you can refine the agent to be as efficient as possible with its compute). However, I think the biggest space of catastrophe is basic interpretability.

We have no idea what the agent is thinking because it can't talk with us. By allowing it to communicate and training it to communicate honestly, we seem to have a much greater chance of getting benevolent AI.

Given the timelines, we need to improve our odds as much as possible. This isn't a perfect solution, but it does seem like it is on the path to it.

Comment by michael-soareverix on [deleted post] 2022-07-24T02:35:08.138Z

I added a comment to ChristianKI's excellent post elaborating on what he said. By the way, you should keep the post up! It's a useful resource for people interested in climate change.

Additionally, if you do believe that AI is an emergency, you should ask her out. You never know, these could be the last years you get so I'd go for it!

Comment by michael-soareverix on [deleted post] 2022-07-24T02:30:21.000Z

Aw, ChristianKI, you got all the points I was going for, even the solar shades idea! I guess I'll try to add some detail instead.

To Ic (the post author): Solar shades are basically just a huge tinfoil sheet that you stretch out once you reach space. The edges require some stability so gravity doesn't warp the tinfoil in on itself, and it has to be in a high orbit so there's no interference, but you basically just send up a huge roll of tinfoil and extend it to manually block sunlight. If things get really bad, we can manually cool down the planet with this type of space infrastructure.

You might also want to give her the book 'Termination Shock', which I've heard is a good discussion of a potential solution to climate change.

Climate change is more than just energy; pollution is a huge issue too. However, there are a number of companies using autonomous boats to clean up plastic in the ocean. I'm way more optimistic than most on the general issue of climate change since I think that it's survivable by our species even in the worst case and not too difficult with stuff like advanced (not even general) AI.

My last mention is about the potential for enzymes that eat plastic. These apparently already exist, and could significantly reduce the waste we deal with.

Comment by Michael Soareverix (michael-soareverix) on How to Diversify Conceptual Alignment: the Model Behind Refine · 2022-07-22T23:25:19.759Z · LW · GW

I'm someone new to the field, and I have a few ideas on it, namely penalizing a model for accessing more compute than it starts with (every scary AI story seems to start with the AI escaping containment and adding more compute to itself, causing an uncontrolled intelligence explosion). I'd like feedback on the ideas, but I have no idea where to post them or how to meaningfully contribute.

I live in America, so I don't think I'll be able to join the company you have in France, but I'd really like to hear where there are more opportunities to learn, discuss, formalize, and test out alignment ideas. As a company focused on this subject, is there a good place for beginners?

Comment by Michael Soareverix (michael-soareverix) on Our Existing Solutions to AGI Alignment (semi-safe) · 2022-07-22T09:21:17.330Z · LW · GW

Another point by Stuart Russel: Objective uncertainty. I'll add this into the list when I've got more time.

Comment by Michael Soareverix (michael-soareverix) on All AGI safety questions welcome (especially basic ones) [July 2022] · 2022-07-17T08:46:37.264Z · LW · GW

What stops a superintelligence from instantly wireheading itself?

A paperclip maximizer, for instance, might not need to turn the universe into paperclips if it can simply access its reward float and set it to the maximum. This is assuming that it has the intelligence and means to modify itself, and it probably still poses an existential risk because it would eliminate all humans to avoid being turned off.

The terrifying thing I imagine about this possibility is that it also answers the Fermi Paradox. A paperclip maximizer seems like it would be obvious in the universe, but an AI sitting quietly on a dead planet with its reward integer set to the max is far more quiet and terrifying.

Comment by Michael Soareverix (michael-soareverix) on Three Minimum Pivotal Acts Possible by Narrow AI · 2022-07-13T19:24:23.889Z · LW · GW

Interesting! I appreciate the details here; it gives me a better sense of why narrow ASI is probably not something that can exist. Is there a place we could talk over audio about AGI alignment versus text here on LessWrong? I'd like to get a better idea of the field, especially as I move into work like creating an AI Alignment Sandbox.

My Discord is Soareverix#7614 and my email is maarocket@gmail.com. I'd really appreciate the chance to talk with you over audio before I begin working on sharing alignment info and coming up with my own methods for solving the problem.

Comment by Michael Soareverix (michael-soareverix) on Three Minimum Pivotal Acts Possible by Narrow AI · 2022-07-13T00:49:59.353Z · LW · GW

Good points. Your point about value alignment being better to solve then just trying to orchestrate a pivotal act is true, but if we don't have alignment solved by the time AGI rolls around, then from a pure survival perspective, it might be better to try a narrow ASI pivotal act instead of hoping that AGI turns out to be aligned already. This solution above doesn't solve alignment in the traditional sense, it just pushes the AGI timeline back hopefully enough to solving alignment.

The idea I have specifically is that you have something like GPT-3 (unintelligent in all other domains, doesn't expand outside of its system or optimize outside of itself) that becomes an incredibly effective Tool AI. GPT-3 isn't really aligned in the Yudkowsky sense, but I'm sure you could get it to write a mildly persuasive piece already. (It sort of already has: https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3 ).

Scale this to superintelligent levels like AlphaGo, and I think you could orchestrate a pivotal act pretty rapidly. It doesn't solve the alignment problem, but it pushes it back.

The problems are that the user needs to be aligned and that the type of narrow ASI has to be developed before AGI. But given the state of narrow ASI, I think it might be one of the best shots, and I do think a narrow ASI could get to this level before AGI, much the same way as AlphaGo proceeded MuZero.

What I am ultimately saying is that if we get a narrow AI that has the power to make a pivotal act, we should probably use it.

Comment by Michael Soareverix (michael-soareverix) on What is wrong with this approach to corrigibility? · 2022-07-13T00:37:58.144Z · LW · GW

I am new to the AI Alignment field, but at first glance, this seems promising! You can probably hard-code it not to have the ability to turn itself off, if that turns out to be a problem in practice. We'd want to test this in some sort of basic simulation first. The problem would definitely be self-modification and I can imagine the system convincing a human to turn it off in some strange, manipulative, and potentially dangerous way. For instance, the model could begin attacking humans, instantly causing a human to run to shut it down, so the model would leave a net negative impact despite having achieved the same reward.

What I like about this approach is that it is simple/practical to test and implement. If we have some sort of alignment sandbox (using a much more basic AI as a controller or test subject) we can give the AI a way of simply manipulating another agent to press the button, as well as ways of maximizing its alternative reward function.

Upvoted, and I'm really interested to see the other replies here.

Comment by Michael Soareverix (michael-soareverix) on A central AI alignment problem: capabilities generalization, and the sharp left turn · 2022-06-17T07:33:55.533Z · LW · GW

Very cool! So this idea has been thought of, and it doesn't seem totally unreasonable, though it definitely isn't a perfect solution. A neat idea is a sort of 'laziness' score so that it doesn't take too many high-impact options.

It would be interesting to try to build an AI alignment testing ground, where you have a little simulated civilization and try to use AI to align properly with it, given certain commands. I might try to create it in Unity to test some of these ideas out in the (less abstract than text and slightly more real) world.

Comment by Michael Soareverix (michael-soareverix) on A central AI alignment problem: capabilities generalization, and the sharp left turn · 2022-06-15T20:07:02.962Z · LW · GW

One solution I can see for AGI is to build in some low-level discriminator that prevents the agent from collecting massive reward. If the agent is expecting to get near-infinite reward in the near future by wiping out humanity using nanotech, then we can set a solution so it decides to do something that will earn it a more finite amount of reward (like obeying our commands).

This has a parallel with drugs here on Earth. Most people are a little afraid of that type of high.

This probably isn't an effective solution, but I'd love to hear why so I can keep refining my ideas.

Comment by Michael Soareverix (michael-soareverix) on AGI Ruin: A List of Lethalities · 2022-06-07T16:23:02.738Z · LW · GW

Appreciate it! Checking this out now

Comment by Michael Soareverix (michael-soareverix) on AGI Ruin: A List of Lethalities · 2022-06-07T06:49:42.155Z · LW · GW

I view AGI in an unusual way. I really don't think it will be conscious or think in very unusual ways outside of its parameters. I think it will be much more of a tool, a problem-solving machine that can spit out a solution to any problem. To be honest, I imagine that one person or small organization will develop AGI and almost instantly ascend into (relative) godhood. They will develop an AI that can take over the internet, do so, and then calmly organize things as they see fit.

GPT-3, DALLE-E 2, Google Translate... these are all very much human-operated tools rather than self-aware agents. Honestly, I don't see a particular advantage to building a self-aware agent. To me, AGI is just a generalizable system that can solve any problem you present it with. The wielder of the system is in charge of alignment. It's like if you had DALL-E 2 20 years ago... what do you ask it to draw? It doesn't have any reason to expand itself outside of its computer (maybe for more processing power? that seems like an unusual leap). You could probably draw some great deepfakes of world leaders and that wouldn't be aligned with humanity, but the human is still in charge. The only problem would be asking it something like "an image designed to crash the human visual system" and getting an output that doesn't align with what you actually wanted, because you are now in a coma.

So, I see AGI as more of a tool than a self-aware agent. A tool that can do anything, but not one that acts on its own.

I'm new to this site, but I'd love some feedback (especially if I'm totally wrong).

-Soareverix

User info

Posts

Comments