roko

"You're correct—none of the studies cited have used a strict GAN-like architecture with a generator and discriminator trained simultaneously in lockstep to prevent AI deception.

Most existing AI safety research, such as Redwood Research or Anthropic’s work, employs sequential or iterative adversarial methods: typically, a model generates potentially deceptive outputs, these outputs are collected, and separately a discriminator or classifier is trained to catch deception. These processes are usually iterative but asynchronous, rather than the simultaneous, competitive co-training seen in GAN architectures.

Your original concept—training generator (task-performing AI) and discriminator (deception detector) networks simultaneously in a true GAN-style loop—does not appear explicitly explored in AI safety literature so far."

Comment by Roko on Roko's Shortform · 2025-03-16T00:56:32.564Z · LW · GW

ChatGPT Deep Research produced this:

https://chatgpt.com/share/67d62105-7c6c-8002-8bbb-74982455839b

Comment by Roko on Roko's Shortform · 2025-03-15T21:48:56.474Z · LW · GW

Preventing deceptive AI misalignment via something like a GAN architecture seems fruitful - you have a generator network that performs some task T, with a score function T().

You then create a dataset of special tasks within the scope of T which have a deceptive answer, and an honest answer which scores lower according to T(). You split this deceptive alignment dataset into a train set and a test set.

Then you train both the generator network and a separate discriminator network - the discriminator is trained to spot deception using the training set and generator is rewarded for maximizing the score T() but penalized when it is caught cheating by the discriminator on the test set. Somehow you prevent the discriminator from memorizing the test set - e.g. you stream in a few examples per iteration and you don't reuse them. The discriminator may have access to the weights of the generator, or to its chain of thought.

There are some potential extensions of this line of work, e.g. you can have a holdout set of deceptive alignment examples that neither the generator nor the discriminator have seen, and you can use these to track progress and to see how the process is evolves over time.

Has anyone actually tried this?

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-13T18:56:33.239Z · LW · GW

You can't guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.

yes, that's true. But in fact if your AI is merely supposed to imitate a human it will be much easier to prevent deceptive alignment because you can find the minimal model that mimics a human, and that minimality excludes exotic behaviors.

This is essentially why machine learning works at all - you don't pick a random model that fits your training data well, you pick the smallest one.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-02T14:10:49.446Z · LW · GW

If one king-person

yes. But this is a very unusual arrangement.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-02T01:19:05.192Z · LW · GW

that's true, however I don't think it's necessary that the person is good.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-02T01:17:53.433Z · LW · GW

asking why inner alignment is hard

I don't think "inner alignment" is applicable here.

If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn't matter what is going on inside.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-01T20:29:54.404Z · LW · GW

The most important thing here is that we can at least achieve an outcome with AI that is equal to the outcome we would get without AI, and as far as I know nobody has suggested a system that has that property.

The famous "list of lethalities" (https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities) piece would consider that a strong success.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-01T13:49:34.002Z · LW · GW

just because it's possible in theory doesn't mean we are anywhere close to doing it

that's a good point, but then you have to explain why it would be hard to make a functional digital copy of a human given that we can make AIs like ChatGPT-o1 that are at 99th percentile human performance on most short-term tasks. What is the blocker?

Of course this question can be settled empirically....

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-01T08:35:42.361Z · LW · GW

All three of these are hard, and all three fail catastrophically.

I would be very surprised if all three of these are equally hard, and I suspect that (1) is the easiest and by a long shot.

Making a human imitator AI, once you already have weakly superhuman AI is a matter of cutting down capabilities and I suspect that it can be achieved by distillation, i.e. using the weakly superhuman AI that we will soon have to make a controlled synthetic dataset for pretraining and finetuning and then a much larger and more thorough RLHF dataset.

Finally you'd need to make sure the model didn't have too many parameters.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2025-01-01T00:04:52.430Z · LW · GW

Perhaps you could rephrase this post as an implication:

IF you can make a machine that constructs human-imitator-AI systems,

THEN AI alignment in the technical sense is mostly trivialized and you just have the usual political human-politics problems plus the problem of preventing anyone else from making superintelligent black box systems.

So, out of these three problems which is the hard one?

(1) Make a machine that constructs human-imitator-AI systems

(2) Solve usual political human-politics problems

(3) Prevent anyone else from making superintelligent black box systems

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T22:28:55.225Z · LW · GW

a misaligned AI might be incentivized to behave identically to a helpful human until it can safely pursue it's true objective

It could, but some humans might also do that. Indeed, humans do that kind of thing all the time.

AIs might behave similar to humans in typical situations but diverge from human norms when they become superintelligent.

But they wouldn't 'become' superintelligent because there would be no extra training once the AI had finished training. And OOD inputs won't produce different outputs if the underlying function is the same. Given a complexity prior and enough data, ML algos will converge on the same function as the human brain uses.

The AIs might be perfect human substitutes individually but result in unexpected emergent behavior that can't be easily forseen in advance when acting as a group. To use an analogy, adding grains of sand to a pile one by one seems stable until the pile collapses in a mini-avalanche.

The behavior will follow the same probability distribution since the distribution of outputs for a given AI is the same as for the human it is a functional copy of. Think of a thousand piles of sand from the same well-mixed batch - each of them is slightly different, but any one pile falls within the distribution.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T22:22:45.220Z · LW · GW

“the true Turing test is whether the AI kills us after we give it the chance, because this distinguishes it from a human”.

no, because a human might also kill you when you give them the chance. To pass the strong-form Turing Test it would have to make the same decision (probabilistically: have the same probability of doing it)

Of what use is this concept?

It is useful because we know what kind of outcomes happen when we put millions of humans together via human history, so "whether an AI will emulate human behavior under all circumstances" is useful.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T22:19:51.299Z · LW · GW

playing word games on the "Turing test" concept does not meaningfully add

It's not a word-game, it's a theorem based on a set of assumptions.

There is still the in-practice question of how you construct a functional digital copy of a human. But imagine trying to write a book about mechanics using the term "center of mass" and having people object to you because "the real center of mass doesn't exist until you tell me how to measure it exactly for the specific pile of materials I have right here!"

You have to have the concept.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T22:12:30.373Z · LW · GW

The whole point of a "test" is that it's something you do before it matters.

No, this is not something you 'do'. It's a purely mathematical criterion, like 'the center of mass of a building' or 'Planck's constant'.

A given AI either does or does not possess the quality of statistically passing for a particular human. If it doesn't under one circumstance, then it doesn't satisfy that criterion.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T21:28:01.716Z · LW · GW

If an AI cannot act the same way as a human under all circumstances (including when you're not looking, when it would benefit it, whatever), then it has failed the Turing Test.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T21:16:20.151Z · LW · GW

that does not mean it will continue to act indistuishable from a human when you are not looking

Then it failed the Turing Test because you successfully distinguished it from a human.

So, you must believe that it is impossible to make an AI that passes the Turing Test. I think this is wrong, but it is a consistent position.

Perhaps a strengthening of this position is that such Turing-Test-Passing AIs exist, but no technique we currently have or ever will have can actually produce them. I think this is wrong but it is a bit harder to show that.

Comment by Roko on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T20:47:04.909Z · LW · GW

This is irrelevant, all that matters is that the AI is a sufficiently close replica of a human. If the human would "act the way the administrators of the test want", then the AI should do that. If not, then it should not.

If it fails to do the same thing that the human that it is supposed to be a copy of would do, then it has failed the Turing Test in this strong form.

For reasons laid out in the post, I think it is very unlikely that all possible AIs would fail to act the same way as the human (which of course may be to "act the way the administrators of the test want", or not, depending on who the human is and what their motivations are).

Comment by Roko on The Dissolution of AI Safety · 2024-12-18T05:20:06.252Z · LW · GW

How can we solve that coordination problem? I have yet to hear a workable idea.

This is my next project!

Comment by Roko on The Dissolution of AI Safety · 2024-12-18T05:19:33.416Z · LW · GW

some guy who was recently hyped about asking o1 for the solution to quantum gravity - it gave the user some gibberish

yes, but this is pretty typical for what a human would generate.

Comment by Roko on Is AI alignment a purely functional property? · 2024-12-18T05:18:06.597Z · LW · GW

There are plenty of systems where we rationally form beliefs about likely outputs from a system without a full understanding of how it works. Weather prediction is an example.

Comment by Roko on Is AI alignment a purely functional property? · 2024-12-16T01:25:20.463Z · LW · GW

I should have been clear: "doing things" is a form of input/output since the AI must output some tokens or other signals to get anything done

Comment by Roko on What is MIRI currently doing? · 2024-12-14T20:40:54.192Z · LW · GW

If you look at the answers there is an entire "hidden" section of the MIRI website doing technical governance!

Comment by Roko on What is MIRI currently doing? · 2024-12-14T09:20:18.798Z · LW · GW

Why is this work hidden from the main MIRI website?

Comment by Roko on What is MIRI currently doing? · 2024-12-14T09:02:05.665Z · LW · GW

nice!

Comment by Roko on What is MIRI currently doing? · 2024-12-14T05:05:22.021Z · LW · GW

"Our objective is to convince major powers to shut down the development of frontier AI systems worldwide"

This?

Comment by Roko on What is MIRI currently doing? · 2024-12-14T03:57:00.764Z · LW · GW

Who works on this?

Comment by Roko on The Dissolution of AI Safety · 2024-12-14T00:30:39.196Z · LW · GW

Re: (2) it will only impact output on the current generated output, once the output is over all that stuff will be reset and the only thing that remains is the model weights which were set in stone at train time.

re: (1) "a LLM might produce text for reasons that don't generalize like a sincere human answer would" it seems that current LLM systems are pretty good at generalizing like a human would and in some ways they are better due to being more honest, easier to monitor, etc

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T19:56:49.435Z · LW · GW

But do you really think we're going to stop with tool AI, and not turn them into agents?

But if it is the case that agentic AI is an existential risk then if actors could choose not to develop it, which is a coordination problem not an alignment problem.

We already have aligned AGI, we can coordinate to not build misaligned AGI.

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T03:20:25.003Z · LW · GW

ok but as a matter of terminology, is a "Satan reverser" misaligned because it contains a Satan?

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T02:58:28.373Z · LW · GW

OK, imagine that I make an AI that works like this: a copy of Satan is instantiated and his preferences are extracted in percentiles, then sentences from Satan's 2nd-5th percentile of outputs are randomly sampled. Then that copy of Satan is destroyed.

Is the "Satan Reverser" AI misaligned?

Is it "inner misaligned"?

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T02:11:57.381Z · LW · GW

So your definition of "aligned" would depend on the internals of a model, even if its measurable external behavior is always compliant and it has no memory/gets wiped after every inference?

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T02:06:01.151Z · LW · GW

Further on the tech tree, alignment tax can end up motivating systematic uses that make LLMs a source of danger.

Sure, but you can say the same about humans. Enron was a thing. Obeying the law is not as profitable as disobeying it.

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T01:43:34.580Z · LW · GW

maybe you should swap "understand ethics" for something like "follow ethics"/"display ethical behavior"

What is the difference between these two? This sounds like a distinction without a difference

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T01:42:24.676Z · LW · GW

Any argument which features a "by definition"

What is your definition of "Aligned" for an LLM with no attached memory then?

Wouldn't it have to be

"The LLM outputs text which is compliant with the creator's ethical standards and intentions"?

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T01:29:56.232Z · LW · GW

To add: I didn't expect this to be controversial but it is currently on -12 agreement karma!

Comment by Roko on The Dissolution of AI Safety · 2024-12-13T01:27:58.352Z · LW · GW

LLMs have plenty of internal state, the fact that it's usually thrown away is a contingent fact about how LLMs are currently used

yes, but then your "Aligned AI based on LLMs" is just a normal LLM used in the way it is currently used.

Relevant aspects of observable behavior screen off internal state that produced it.

Yes this is a good way of putting it.

Comment by Roko on The Dissolution of AI Safety · 2024-12-12T21:52:25.070Z · LW · GW

equivalence between LLMs understanding ethics and caring about ethics

I think you don't understand what an LLM is. When the LLM produces a text output like "Dogs are cute", it doesn't have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.

The LLM is just a memoryless machine that produces text. If it says "dogs are cute" and that's the end of the output, then that's all there is to it. Nothing is saved, the weights are fixed at training time and not updated at inference time and the neuron activations are thrown away at the of the inference computation.

If you can get (using RLHF) an LLM to output text that consistently reflects human value judgements, then it is by definition "aligned". It really cares, in the only way it is possible for a text generator to care.

Comment by Roko on If far-UV is so great, why isn't it everywhere? · 2024-10-20T23:45:39.807Z · LW · GW

Yes, certain places like preschools might benefit even from an isolated install.

But that is kind of exceptional.

The world isn't an efficient market, especially because people are kind of set in their ways and like to stick to the defaults unless there is strong social pressure to change.

Comment by Roko on If far-UV is so great, why isn't it everywhere? · 2024-10-20T21:59:57.221Z · LW · GW

Far-UVC probably would have a large effect if a particular city or country installed it.

But if only a few buildings install it, then it has no effect because people just catch the bugs elsewhere.

Imagine the effect of just treating sewage from one house, and leaving all the untreated sewage from a million houses untreated in the river. There would be essentially no effect.

Comment by Roko on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented? · 2024-10-20T16:36:27.334Z · LW · GW

ok so from the looks of that it basically just went along with a fantasy he already had. But this is an interesting case and an example of the kind of thing I am looking for.

Comment by Roko on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented? · 2024-10-20T16:34:53.710Z · LW · GW

ok, but this is sort of circular reasoning because the only reason people freaked out is that they were worried about AI risk.

I am asking for a concrete bad outcome in the real world caused by a lack of RLHF-based ethics alignment, which isn't just people getting worried about AI risk.

Comment by Roko on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented? · 2024-10-20T16:33:29.330Z · LW · GW

alignment has always been about doing what the user/operator wants

Well it has often been about not doing what the user wants, actually.

Comment by Roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T22:25:22.689Z · LW · GW

giving each individual influence over the adoption (by any clever AI) of those preferences that refer to her.

Influence over preferences of a single entity is much more conflict-y.

Comment by Roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T22:24:25.837Z · LW · GW

Trying to give everyone overlapping control over everything that they care about in such spaces introduces contradictions.

The point of ELYSIUM is that people get control over non-overlapping places. There are some difficulties where people have preferences over the whole universe. But the real world shows us that those are a smaller thing than the direct, local preference to have your own volcano lair all to yourself.

Comment by Roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T20:17:30.238Z · LW · GW

catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer

And, the problem with saying "OK let's just ban the creation of catgirls" is that then maybe Reedspacer builds a volcano lair just for himself and plays video games in it, and the catgirls whose existence you prevented are going to scream bloody murder because you took away from them a very good existence that they would have enjoyed and also made Reedsapcer sad.

Comment by Roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T20:01:56.152Z · LW · GW

The question of what BPA wants to do to Steve, seems to me to be far more important for Steve's safety, than the question of what set of rules will constrain the actions of BPA.

BPA shouldn't be allowed to want anything for Steve. There shouldn't be a term in its world-model for Steve. This is the goal of cosmic blocking. The BPA can't even know that Steve exists.

I think the difficult part is when BPA looks at Bob's preferences (excluding, of course, references to most specific people) and sees preferences for inflicting harm on people-in-general that can be bent just enough to fit into the "not-torture" bucket, and so it synthetically generates some new people and starts inflicting some kind of marginal harm on them.

And I think that this will in fact be a binding constraint on utopia, because most humans will (given the resources) want to make a personal utopia full of other humans that forms a status hierarchy with them at the top. And 'being forced to participate in a status hierarchy that you are not at the top of' is a type of 'generalized consensual harm'.

Even the good old Reedspacer's Lower Bound fits this model. Reedspacer wants a volcano lair full of catgirls, but the catgirls are consensually participating in a universe that is not optimal for them because they are stuck in the harem of a loser nerd with no other males and no other purpose in life other than being a concubine to Reedspacer. Arguably, that is a form of consensual harm to the catgirls.

So I don't think there is a neat boundary here. The neatest boundary is informed consent, perhaps backed up by some lower-level tests about what proportion of an entity's existence is actually miserable.

If Reedspacer beats his catgirls, makes them feel sad all the time, that matters. But maybe if one of them feels a little bit sad for a short moment that is acceptable.

Comment by Roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T19:29:06.106Z · LW · GW

Steve will never become aware of what Bob is doing to OldSteve

But how would Bob know that he wanted to create OldSteve, if Steve has been deleted from his memory via a cosmic block?

I suppose perhaps Bob could create OldEve. Eve is in a similar but not identical point in personality space to Steve and the desire to harm people who are like Eve is really the same desire as the desire to harm people like Steve. So Bob's Extrapolated Volition could create OldEve, who somehow consents to being mistreated in a way that doesn't trigger your torture detection test.

This kind of 'marginal case of consensual torture' has popped up in other similar discussions. E.g. In Yvain's (Scott Alexander's) article on Archipelago there's this section:

"""A child who is abused may be too young to know that escape is an option, or may be brainwashed into thinking they are evil, or guilted into believing they are betraying their families to opt out. And although there is no perfect, elegant solution here, the practical solution is that UniGov enforces some pretty strict laws on child-rearing, and every child, no matter what other education they receive, also has to receive a class taught by a UniGov representative in which they learn about the other communities in the Archipelago, receive a basic non-brainwashed view of the world, and are given directions to their nearest UniGov representative who they can give their opt-out request to"""

So Scott Alexander's solution to OldSteve is that OldSteve must get a non-brainwashed education about how ELYSIUM/Archipelago works and be given the option to opt out.

I think the issue here is that "people who unwisely consent to torture even after being told about it" and "people who are willing and consenting submissives" is not actually a hard boundary.

Comment by Roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T19:10:48.088Z · LW · GW

a 55 percent majority (that does not have a lot of resource needs) burning 90 percent of all resources in ELYSIUM to fully disenfranchise everyone else. And then using the remaining resources to hurt the minority.

If there is an agent that controls 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone else, then assuming that ELYSIUM forbids them to do that, their rational move is to use their resources to prevent ELYSIUM from being built.

And since they control 55% of the resources in the universe and are prepared to use 90% of that 55% to kill/destroy everyone who was trying to actually create ELYSIUM, they would likely succeed and ELYSIUM wouldn't happen.

Re:threats, see my other comment.

Comment by Roko on The ELYSIUM Proposal - Extrapolated voLitions Yielding Separate Individualized Utopias for Mankind · 2024-10-19T18:55:05.364Z · LW · GW

Especially if they like the idea of killing someone for refusing to modify the way that she lives her life. They can do this with person after person, until they have run into 9 people that prefers death to compliance. Doing this costs them basically nothing.

This assumes that threats are allowed. If you allow threats within your system you are losing out on most of the value of trying to create an artificial utopia because you will recreate most of the bad dynamics of real history which ultimately revolve around threats of force in order to acquire resources. So, the ability to prevent entities from issuing threats that they then do not follow through on is crucial.

Improving the equilibria of a game is often about removing strategic options; in this case the goal is to remove the option of running what is essentially organized crime.

In the real world there are various mechanisms that prevent organized crime and protection rackets. If you threaten to use force on someone in exchange for resources, the mere threat of force is itself illegal at least within most countries and is punished by a loss of resources far greater than the threat could win.

People can still engage in various forms of protest that are mutually destructive of resources (AKA civil disobedience).

The ability to have civil disobedience without protection rackets does seem kind of crucial.

User info

Posts

Comments