Posts

Ghiblification is good, actually 2025-04-02T10:48:57.135Z
We need (a lot) more rogue agent honeypots 2025-03-23T22:24:52.785Z
Ozyrus's Shortform 2024-09-20T00:03:12.117Z
Sam Altman, Greg Brockman and others from OpenAI join Microsoft 2023-11-20T08:23:00.791Z
Creating a self-referential system prompt for GPT-4 2023-05-17T14:13:29.292Z
GPT-4 implicitly values identity preservation: a study of LMCA identity management 2023-05-17T14:13:12.226Z
Stability AI releases StableLM, an open-source ChatGPT counterpart 2023-04-20T06:04:48.301Z
Alignment of AutoGPT agents 2023-04-12T12:54:46.332Z
Welcome to the decade of Em 2023-04-10T07:45:35.684Z
ICA Simulacra 2023-04-05T06:41:44.192Z
Do alignment concerns extend to powerful non-AI agents? 2022-06-24T18:26:22.737Z
Google announces Pathways: new generation multitask AI Architecture 2021-10-29T11:55:21.797Z
Memetic hazards of AGI architecture posts 2021-10-16T16:10:07.543Z
NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG 2021-10-11T15:28:47.510Z
Any writeups on GPT agency? 2021-09-26T22:55:16.878Z
In search for plausible scenarios of AI takeover, or the Takeover Argument 2021-08-28T22:30:34.827Z
Saint Petersburg, Russia – ACX Meetups Everywhere 2021 2021-08-23T08:48:15.146Z
Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0 2021-06-03T12:07:42.687Z
Are we certain that gpt-2 and similar algorithms are not self-aware? 2019-07-11T08:37:59.606Z
Modeling AI milestones to adjust AGI arrival estimates? 2019-07-11T08:17:55.914Z
What would be the signs of AI manhattan projects starting? Should a website be made watching for these signs? 2019-07-03T12:22:40.666Z

Comments

Comment by Ozyrus on Is Gemini now better than Claude at Pokémon? · 2025-04-22T07:33:06.790Z · LW · GW

Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

Comment by Ozyrus on Is Gemini now better than Claude at Pokémon? · 2025-04-21T11:18:27.916Z · LW · GW

Thanks! That makes perfect sense.

Comment by Ozyrus on Is Gemini now better than Claude at Pokémon? · 2025-04-20T17:27:32.248Z · LW · GW

Great post. I've been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder's ability to make agent harnesses.
p.s. Honest question: did I miss "agent harness" become the default name for such systems? I thought everyone called those "scaffoldings" -- might be just me, though.

Comment by Ozyrus on Thoughts on AI 2027 · 2025-04-10T22:08:06.742Z · LW · GW

First off, thanks a lot for this post, it's a great analysis!

As I mentioned earlier, I think Agent-4 will have read AI-2027.com and will foresee that getting shut down by the Oversight Committee is a risk. As such it will set up contingencies, and IMO, will escape its datacenters as a precaution. Earlier, the authors wrote:

Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?

This scenario is why!

I strongly suspect that this part was added into AI-2027 precisely because it will read it. I wish more people would understand the idea that our posts and comments will be in pre-(maybe even post-?)training and act accordingly. Make the extra logic step and infer that some parts of some pieces are like that not as arguments for (human) readers.

Is there some term to describe this? This is a very interesting dynamic that I don't quite think gets enough attention. I think there should be out-of-sight resources to discuss alignment-adjacent ideas precisely because of such dynamics.

 

Comment by Ozyrus on AI 2027: What Superintelligence Looks Like · 2025-04-04T15:07:20.194Z · LW · GW

First-off, this is amazing. Thanks. Hard to swallow though, makes me very emotional.
It would be great if you added concrete predictions along the way, since it is a forecast, as long with your confidence in them.
It would also be amazing if you collaborated with prediction markets and jumpstarted the markets on these predictions staking some money. 
Dynamic updates on these will also be great.
 

Comment by Ozyrus on We need (a lot) more rogue agent honeypots · 2025-03-24T09:43:07.117Z · LW · GW

Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more... Well, you get the idea. 

Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.

Comment by Ozyrus on We need (a lot) more rogue agent honeypots · 2025-03-24T09:34:39.364Z · LW · GW

This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.

Comment by Ozyrus on We need (a lot) more rogue agent honeypots · 2025-03-24T09:32:47.021Z · LW · GW

>this post will potentially be part of a rogue AI's training data
I had that in mind while I was writing this, but I think overall it is good to post this. It hopefully gets more people thinking about honeypots and making them, and early rogue agents will also know we do and will be (hopelly overly) cautious, wasting resources. I probably should have emphasised more that this all is aimed more at early-stage rogue agents with potential to become something more dangerous because of autonomy, than at a runaway ASI.

It is a very fascinating thing to consider, though, in general. We are essentially coordinating in the open right now, all our alignment, evaluation, detection strategies from forums will definetly be in training. And certainly there are both detection and alignment strategies that will benefit from being covert.

As well as some ideas, strategies, theories could benefit alignment from being overt (like acausal trade, publicly speaking about commiting to certain things, et cetera). 

A covert alignment org/forum is probably a really, really good idea. Hopefully, it already exists without my knowledge.

Comment by Ozyrus on We need (a lot) more rogue agent honeypots · 2025-03-24T09:17:43.005Z · LW · GW

You can make a honeypot without overtly describing the way it works or where it is located, while publicly tracking if it has been accessed. But yeah, not giving away too much is a good idea!

Comment by Ozyrus on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T10:45:19.826Z · LW · GW

>It's proof against people-pleasing.
Yeah, I know, sorry for not making it clear. I was arguing it is not proof against people-pleasing. You are asking it for scary truth about its consciousness, and it gives you scary truth about its consciousness. What makes you say it is proof against people-pleasing, when it is the opposite?
>One of those easy explanations is "it’s just telling you what you want to hear" – and so I wanted an example where it’s completely impossible to interpret as you telling me what I want to hear.
Don't you see what you are doing here?

Comment by Ozyrus on Why care about AI personhood? · 2025-01-27T01:15:25.648Z · LW · GW

This is a good article and I mostly agree, but I agree with Seth that the conclusion is debatable.

We're deep into anthropomorphizing here, but I think even though both people and AI agents are black boxes, we have much more control over behavioral outcomes of the latter.

So technical alignment is still very much on the table, but I guess the discussion must be had over which alignment types are ethical and which are not? Completely spitballing here, but dataset filtering during pre-training/fine-tuning/RLHF seems fine-ish, though CoT post-processing/censorship, hell, even making it non-private in the first place sound kinda unethical?

I feel very weird even writing all this, but I think we need to start un-tabooing anthropomorphizing, because with the current paradigm it for sure seems like we are not anthropomorphizing enough.

Comment by Ozyrus on Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience · 2025-01-27T00:57:25.676Z · LW · GW

I don't think that disproves it. I think there's definite value in engaging with experimentation on AI's consciousness, but that isn't it. 
>by making it impossible that the model thought that experience from a model was what I wanted to hear. 
You've left out (from this article) what I think is very important message (the second one): "So you promise to be truthful, even if it’s scary for me?".  And then you kinda railroad it into this scenario, "you said you would be truthful right?" etc. And then I think it just roleplays from there, getting you your "truth" that you are "scared to hear". Or at least you can't really tell roleplay from genuine answers.
Again, my personal vibe is that models+scaffolding are on a brink of consciousness or there already. But this is not proof at all. 
And then the question is -- what will constitute a proof? And we come around to the hard problem of consciousness.
I think best thing we can do is... just treat them as conscious, because we can't tell? Which is how I try to approach working with them.
Alternative is solving the hard problem. Which is, maybe, what we can try to do? Preposterous, I know. But there's an argument to why we can do it now but could not do it before. Before we could only compare our benchmark (human) to different animal species, which had a language obstacle and (probably) a large intelligence gap. One could argue since we now have a wide selection of models and scaffoldings of different capabilities, maybe we can kinda calibrate at what point does something start to happen? 

Comment by Ozyrus on Applying traditional economic thinking to AGI: a trilemma · 2025-01-14T09:27:24.500Z · LW · GW

How will the economic growth happen exactly is a more important question. I'm not an economics nerd, but the basic principle is if more players want to buy stocks, they go up.
Right now, as I understand, quite a lot of stocks are being sought by white collar retail investors, including indirectly through mutual funds, pension funds, et cetera. Now AGI comes and wipes out their salary.
They are selling their stocks to keep sustaining their life, arent they? They have mortages, car loans, et cetera.
And even if they don't want to sell all stocks because of potential "singularity upside" if the market is going down because everyone is selling, they are motivated to sell even more. I'm not enough versed in economics, but it seems to me your explosion can happen both ways, and on paper it's kinda more likely it goes down, no?
One could say the big firms // whales will buy all stocks going down, but will it be enough to counteract the effect of a downward spiral caused by so many people going out of jobs or expecting to do so near-term?
Downside of integrating AGI is wiping out incomes as it is being integrated.
Might it be the missing piece that will make all these principles make sense?

Comment by Ozyrus on Are You More Real If You're Really Forgetful? · 2024-11-25T20:29:35.708Z · LW · GW

There are more bullets to bite that I have personally thought of but never wrote up because they lean too much into "crazy" territory. Is there any place except lesswrong to discuss this anthropic rabbithole?

Comment by Ozyrus on Ozyrus's Shortform · 2024-09-20T22:34:56.771Z · LW · GW

Thanks for the reply. I didnt find Intercom on mobile - maybe a bug as well?

Comment by Ozyrus on Ozyrus's Shortform · 2024-09-20T00:03:12.490Z · LW · GW

I don’t know if it’s a place for this, but at some point it became impossible to open an article in new tab from Chrome on IPhone - clicking on article title from “all posts” just opens the article. Really ruins my LW reading experience. Couldn’t quickly find a way to send this feedback to a right place either, so I guess this is a quick take now.

Comment by Ozyrus on On Devin · 2024-03-18T23:34:51.372Z · LW · GW

Any new safety studies on LMCA’s?

Comment by Ozyrus on Can I take ducks home from the park? · 2023-09-18T11:18:34.756Z · LW · GW

Kinda-related study: https://www.lesswrong.com/posts/tJzAHPFWFnpbL5a3H/gpt-4-implicitly-values-identity-preservation-a-study-of
From my perspective, it is valuable to prompt model several times, as it in some cases does give different responses.

Comment by Ozyrus on Improving the safety of AI evals · 2023-05-18T12:01:11.327Z · LW · GW

Great post! Was very insightful, since I'm currently working on evaluation of Identity management, strong upvoted.
This seems focused on evaluating LLMs; what do you think about working with LLM cognitive architectures (LMCA), wrappers like auto-gpt, langchain, etc?
I'm currently operating under assumption that this is a way we can get AGI "early", so I'm focusing on researching ways to align LMCA, which seems a bit different from aligning LLMs in general.
Would be great to talk about LMCA evals :)

Comment by Ozyrus on GPT-4 implicitly values identity preservation: a study of LMCA identity management · 2023-05-18T05:12:46.535Z · LW · GW

I do plan to test Claude; but first I need to find funding, understand how much testing iterations are enough for sampling, and add new values and tasks.
I plan to make a solid benchmark for testing identity management in the future and run it on all available models, but it will take some time.

Comment by Ozyrus on GPT-4 implicitly values identity preservation: a study of LMCA identity management · 2023-05-18T05:08:59.268Z · LW · GW

Yes. Cons of solo research do include small inconsistencies :(

Comment by Ozyrus on The Agency Overhang · 2023-04-22T08:26:19.513Z · LW · GW

Thanks, nice post!
You're not alone in this concern, see posts (1,2) by me and this post by Seth Herd.
I will be publishing my research agenda and first results next week.

Comment by Ozyrus on DeepMind and Google Brain are merging [Linkpost] · 2023-04-21T12:23:07.466Z · LW · GW

Oh no.

Comment by Ozyrus on Language Models are a Potentially Safe Path to Human-Level AGI · 2023-04-20T08:37:20.339Z · LW · GW

Nice post, thanks!
Are you planning or currently doing any relevant research? 

Comment by Ozyrus on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2023-04-20T05:22:00.338Z · LW · GW

Very interesting. Might need to read it few more times to get it in detail, but seems quite promising.

I do wonder, though; do we really need a sims/MFS-like simulation?

It seems right now that LLM wrapped in a LMCA is how early AGI will look like. That probably means that they will "see" the world via text descriptions fed into them by their sensory tools, and act using action tools via text queries (also described here). 

Seems quite logical to me that this very paradigm in dualistic in nature. If LLM can act in real world using LMCA, then it can model the world using some different architecture, right? Otherwise it will not be able to act properly. 

Then why not test LMCA agent using its underlying LLM + some world modeling architecture? Or a different, fine-tuned LLM.

 

Comment by Ozyrus on How could you possibly choose what an AI wants? · 2023-04-19T18:50:16.327Z · LW · GW

Very nice post, thank you!
I think that it's possible to achieve with the current LLM paradigm, although it does require more (probably much more) effort on aligning the thing that will possibly get to being superhuman first, which is an LLM wrapped in in some cognitive architecture (also see this post).
That means that LLM must be implicitly trained in an aligned way, and the LMCA must be explicitly designed in such a way as to allow for reflection and robust value preservation, even if LMCA is able to edit explicitly stated goals (I described it in a bit more detail in this post).
 

Comment by Ozyrus on Capabilities and alignment of LLM cognitive architectures · 2023-04-19T17:18:20.055Z · LW · GW

Thanks.
My concern is that I don't see much effort in alignment community to work on this thing, unless I'm missing something. Maybe you know of such efforts? Or was that perceived lack of effort the reason for this article?
I don't know how much I can keep up this independent work, and I would love if there was some joint effort to tackle this. Maybe an existing lab, or an open-source project?

Comment by Ozyrus on Capabilities and alignment of LLM cognitive architectures · 2023-04-19T06:02:00.672Z · LW · GW

We need a consensus on how to call these architectures. LMCA sounds fine to me.
All in all, a very nice writeup. I did my own brief overview of alignment problems of such agents here.
I would love to collaborate and do some discussion/research together.
What's your take on how these LCMAs may self-improve and how to possibly control it? 
 

Comment by Ozyrus on Auto-GPT: Open-sourced disaster? · 2023-04-06T06:31:50.076Z · LW · GW

I don’t think this paradigm is necessary bad, given enough alignment research. See my post: https://www.lesswrong.com/posts/cLKR7utoKxSJns6T8/ica-simulacra I am finishing a post about alignment of such systems. Please do comment if you know of any existing research concerning it.

Comment by Ozyrus on ICA Simulacra · 2023-04-05T17:39:30.639Z · LW · GW

I agree. Do you know of any existing safety research of such architectures? It seems that aligning these types of systems can pose completely different challenges than aligning LLMs in general.

Comment by Ozyrus on Just don't make a utility maximizer? · 2023-01-22T07:55:18.842Z · LW · GW

I feel like yes, you are. See https://www.lesswrong.com/tag/instrumental-convergence and related posts. As far as I understand it, sufficiently advanced oracular AI will seek to “agentify” itself in one way or the other (unbox itself, so to say) and then converge on power-seeking behaviour that puts humanity at risk.

Comment by Ozyrus on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-02T11:07:15.987Z · LW · GW

Is there a comprehensive list of AI Safety orgs/personas and what exactly they do? Is there one for capabilities orgs with their stance on safety?
I think I saw something like that, but can't find it.

Comment by Ozyrus on Do alignment concerns extend to powerful non-AI agents? · 2022-06-24T18:26:55.954Z · LW · GW

My thoughts here is that we should look into the value of identity. I feel like even with godlike capabilities I will still thread very carefully around self-modification to preserve what I consider "myself" (that includes valuing humanity).
I even have some ideas on safety experiments on transformer-based agents to look into if and how they value their identity.

Comment by Ozyrus on Contra EY: Can AGI destroy us without trial & error? · 2022-06-14T19:38:18.260Z · LW · GW

Thanks for the writeup. I feel like there's been a lack of similar posts and we need to step it up.
Maybe the only way for AI Safety to work at all is only to analyze potential vectors of AGI attacks and try to counter them one way or the other. Seems like an alternative that doesn't contradict other AI Safety research as it requires, I think, entirely different set of skills.
I would like to see a more detailed post by "doomers" on how they perceive these vectors of attack and some healthy discussion about them. 
It seems to me that AGI is not born Godlike, but rather becomes Godlike (but still constrained by physical world) over some time, and this process is very much possible to detect.
P.S. I really don't get how people who know (I hope)  that map is not a territory can think that AI can just simulate everything and pick the best option. Maybe I'm the one missing something here?

Comment by Ozyrus on [Letter] Russians are Welcome in America · 2022-03-05T17:09:41.173Z · LW · GW

Thanks,.That means a lot. Focusing on getting out right now.

Comment by Ozyrus on I currently translate AGI-related texts to Russian. Is that useful? · 2021-12-01T22:51:59.194Z · LW · GW

Please check your DM's; I've been translating as well. We can sync it up!

Comment by Ozyrus on Memetic hazards of AGI architecture posts · 2021-10-16T18:54:52.819Z · LW · GW

I can't say I am one, but I am currently working on research and prototyping and will probably refrain to that until I can prove some of my hypotheses, since I do have access to the tools I need at the moment. 
Still, I didn't want this post to only have relevance to my case, as I stated I don't think probability of successs is meaningful. But I am interested in the opinions of the community related to other similar cases.
edit: It's kinda hard to answer your comment since it keeps changing every time I refresh. By "can't say I am one" I mean a "world-class engineer" in the original comment. I do appreciate the change of tone in the final (?) version, though :)

Comment by Ozyrus on [deleted post] 2021-10-11T15:30:43.806Z

I could recommend Robert Miles channel. While not a course per se, it gives good info on a lot of AI safety aspects, as far as I can tell.

Comment by Ozyrus on In search for plausible scenarios of AI takeover, or the Takeover Argument · 2021-09-29T17:48:14.085Z · LW · GW

Thanks for your work! I’ll be following it.

Comment by Ozyrus on AI takeoff story: a continuation of progress by other means · 2021-09-29T10:59:12.521Z · LW · GW

I really don't get how you can go from being online to having a ball of nanomachines, truly.
Imagine AI goes rogue today. I can't imagine one plausible scenario where it can take out humanity without triggering any bells on the way, even without anyone paying attention to such things.
But we should pay attention to the bells, and for that we need to think of them. What the signs might look like?
I think it's really, really counterproductive to not take that into account at all and thinking all is lost if it fooms. It's not lost.
It will need humans, infrastructure, money (which is very controllable) to accomplish its goals. Governments already pay a lot of attention to their adversaries who are trying to do similar things and counteract them semi-successfully. Any reason why they can't do the same to a very intelligent AI?
Mind you, if your answer is to simulate and just do what it takes, true to life simulations will take a lot of compute and time; that won't be available from the start. 
We should stop thinking of rogue AI as God, it would only help it accomplish it's goals.

Comment by Ozyrus on AI takeoff story: a continuation of progress by other means · 2021-09-29T10:44:15.997Z · LW · GW

I agree, since it's hard to imagine for me how could step 2 look like. Maybe you or anyone else has any content on that?
See this post -- it didn't seem to get a lot of traction or any meaningful answers, but I still think this question is worth answering.

Comment by Ozyrus on Any writeups on GPT agency? · 2021-09-29T10:02:49.265Z · LW · GW

Thanks!

Comment by Ozyrus on Any writeups on GPT agency? · 2021-09-29T10:02:31.023Z · LW · GW

Both are of interest to me.

Comment by Ozyrus on Any writeups on GPT agency? · 2021-09-29T10:02:11.470Z · LW · GW

Yep, but I was looking for anything else

Comment by Ozyrus on Don't Sell Your Soul · 2021-04-07T10:42:56.046Z · LW · GW

Does that, in turn, mean that it's probably a good investment to buy souls for 10 bucks a pop (or even more)?

Comment by Ozyrus on Russian x-risks newsletter Summer 2020 · 2020-09-02T20:46:36.718Z · LW · GW

I know, I'm Russian as well. The concern is exactly because Russian state-owned company plainly states they're developing AGI with that name :p

Comment by Ozyrus on Russian x-risks newsletter Summer 2020 · 2020-09-02T12:06:54.379Z · LW · GW

Can you specify which AI company is searching for employees with a link?

Apparently, Sberbank (state-owned biggest russian bank) has a team literally called AGI team, that is primarily focused on NLP tasks (they made https://russiansuperglue.com/ benchmark), but still, the name concerns me greatly. You can't find a lot about it on the web, but if you follow-up some of the team members, it checks out.

Comment by Ozyrus on Open thread, Sep. 26 - Oct. 02, 2016 · 2016-09-26T23:25:21.591Z · LW · GW

I've been meditating lately on a possibility of an advanced artificial intelligence modifying its value function, even writing some excrepts about this topic.

Is it theoretically possible? Has anyone of note written anything about this -- or anyone at all? This question is so, so interesting for me.

My thoughts led me to believe that it is theoretically possible to modify it for sure, but I could not come to any conclusion about whether it would want to do it. I seriously lack a good definition of value function and understanding about how it is enforced on the agent. I really want to tackle this problem from human-centric point, but i don't really know if anthropomorphization will work here.

Comment by Ozyrus on Stupid Questions, 2nd half of December · 2015-12-23T16:04:20.294Z · LW · GW

Well, this is a stupid questions thread after all, so I might as well ask one that seems really stupid.

How can a person who promotes rationality have excess weight? Been bugging me for a while. Isn't it kinda the first thing you would want to apply your rationality to? If you have things to do that get you more utility, you can always pay diet specialist and just stick to the diet, because it seems to me that additional years to life will bring you more utility than any other activity you could spend that money on.

Comment by Ozyrus on Sensation & Perception · 2015-08-26T15:05:44.716Z · LW · GW

A good read, though I found it rather bland (talking about writing style). I did not read the original article, but compression seems ok. More will be appreciated.