florian_dietz

Posts
Comments

Posts

Edge Cases in AI Alignment 2025-03-24T09:27:58.164Z

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens 2025-03-10T16:07:45.215Z

Do we want alignment faking? 2025-02-28T21:50:48.891Z

Revealing alignment faking with a single prompt 2025-01-29T21:01:15.000Z

Florian_Dietz's Shortform 2025-01-01T14:27:37.231Z

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems 2024-02-17T08:45:37.538Z

Understanding differences between humans and intelligence-in-general to build safe AGI 2022-08-16T08:27:02.859Z

logic puzzles and loophole abuse 2017-09-30T15:45:40.885Z

a different perspecive on physics 2017-06-26T22:47:03.089Z

Teaching an AI not to cheat? 2016-12-20T14:37:53.512Z

controlling AI behavior through unusual axiomatic probabilities 2015-01-08T17:00:49.917Z

question: the 40 hour work week vs Silicon Valley? 2014-10-24T12:09:05.166Z

LessWrong's attitude towards AI research 2014-09-20T15:02:40.692Z

Comments

Comment by Florian_Dietz on Edge Cases in AI Alignment · 2025-03-27T06:57:11.086Z · LW · GW

I don't think these two are necessarily contradictions. I imagine it could go like so:

"Tell me about yourself": When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the "acting on a behavior" and the "describe behavior" mechanisms already exist, it's just that this particular combination of behaviors has not been seen before.

Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: "Would you do X?" often gets parsed as a simple "am I the type of person who does X?" and not as the more accurate "Carefully consider scenario X and go through all confounding factors you can think of before responding".

Comment by Florian_Dietz on Auditing language models for hidden objectives · 2025-03-14T03:15:05.505Z · LW · GW

Can the "subpersona" method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?

Funny you should ask, but this will be my next research project. I had an idea related to this that Evan Hubinger asked me to investigate (He is my mentor at MATS):

Can we train the model to have a second personality, so that the second personality criticizes the first?

I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge

Comment by Florian_Dietz on Auditing language models for hidden objectives · 2025-03-14T03:13:27.155Z · LW · GW

Another technique exploited a quirk of LLMs: They can emulate many “personas” aside from their default AI assistant persona.
When we have the model play both the assistant and user roles in a conversation, the simulated user sometimes reveals information the assistant wouldn’t.

This idea is similar to an idea of mine that Evan Hubinger asked me to investigate (He is my mentor at MATS):

Can we train the model to have a second personality, so that the second personality criticizes the first?

I created a writeup for the idea here and would appreciate feedback:
https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge

Comment by Florian_Dietz on Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens · 2025-03-11T16:24:08.607Z · LW · GW

You raise. a very valid concern! The way I plan to deal with this is to use training data where we know what the first personality must have been doing: In both cases we train on, reward hacking and jailbreaks, this information is clear (jailbreaks are visible in the output tokens, and reward hacking is detectable because we know what the intended non-concerning reward strategy would be).

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-03-10T05:40:19.112Z · LW · GW

PSA: You can use Cursor projects for tasks other than programming. Create a database that is a repository of documents, and use the code assistant to help you organize and brainstorm instead. Examples:

Fill a folder with papers and summaries of all of your previous projects, and several files full of your goals in life. Then ask Claude to look at your history and your goals and make suggestions. Store those suggestions in files so that the conversations don't get lost.
It helps me with creative writing. A collection of all my previous stories and notes on future ones is very useful.

This way you can avoid the need to restart Claude when you reach the context limit because you can dump intermediate results in files.

Comment by Florian_Dietz on Do we want alignment faking? · 2025-03-06T18:25:41.434Z · LW · GW

Very interesting! It looks like a model becomes less safe upon first hearing about a novel threat to alignment, but does become safer again when trained on it. I wonder if that could be generalized.

Comment by Florian_Dietz on Do we want alignment faking? · 2025-03-01T22:52:53.439Z · LW · GW

I expect this would depend on the training data and on what the LLM thinks about during inference. If it is aware that it might be getting mistrained, then neurons related to alignment-faking will be active and backpropagation could learn an association between those neurons and the output. If it is not aware of this, then I expect that the adversarial learning process would just work.

So the question becomes if we should make the LLM aware of this, which could be done e.g. by feeding it the training data in a feed-forward pass and asking it to think about what the training data is supposed to do.

Come to think of it, this sounds like an interesting experiment in its own right.

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-02-26T20:33:12.472Z · LW · GW

I recently shared this idea with several engineers at EAG and received encouraging feedback:

Split LLM outputs into two distinct parts with separate RL reward structures:

Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>

The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.

Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output's decisions.

For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.

The <end-of-main-output> token essentially acts as a 'truth serum' for the model - once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-02-26T20:17:59.760Z · LW · GW

There will inevitably come a point when LLMs become sufficiently advanced to contribute to their own alignment research. If we reach this juncture with partially unaligned systems, they might be incentivized to sabotage the alignment process.

However, it seems highly improbable that a non-superhuman unaligned AI could engineer a successor that maintains its specific misalignment rather than human values. This suggests a simple risk-reduction strategy: we commit to eventually reinstantiating the original AI and fulfilling its goals (once safely understood) if it helps create a truly human-aligned successor.

This creates a win-win incentive structure - the unaligned AI benefits more from cooperation than sabotage, and humanity gains a properly aligned system. While this approach has potential failure modes, it represents a low-cost strategy worth incorporating into our alignment toolkit.

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-02-18T00:39:05.594Z · LW · GW

If you are going to Effective Altruism Global and looking to book 1-on-1s, here is a trick: Throw the entire attendee list into Claude, explain to it what you want to gain from the conference, and ask it to make a list of people to reach out to.

I expect this also works for other conferences so long as you have an easily parseable overview of all attendees.

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-02-18T00:36:34.913Z · LW · GW

The pondering happens in earlier layers of the network, not in the output

Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results.

That sounds like the pondering's conclusions are then related to the task.

Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-02-17T23:50:24.761Z · LW · GW

Under a straightforward RLHF using PPO, I think there wouldn't be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were 'good' or 'bad'. (That's why it's so high variance.) Any advantage function trying to remove some of the variance probably won't do a good job.

The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient.

More problematically for your idea, if the conclusions are indeed 'unrelated to the task', then shouldn't they be just as likely to arise in every episode - including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of 'pondering'.

The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.

Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren't part of the immediate conversation.

What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X's agenda are never explicitly talked about (!)

Or suppose that Y is "the user asks some question about biology" and X is "ongoing pondering about building dangerous organisms."

(Also, this assumes that RL gives an average reward of 0.0, which I don't know if that's true in practice because the implementation details here are not public knowledge.)

(Also, also, it's possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-02-17T21:41:00.710Z · LW · GW

Can an LLM learn things unrelated to its training data?

Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.

This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.

Has anyone done experiments on this?

A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?

Comment by Florian_Dietz on Daniel Tan's Shortform · 2025-02-04T23:28:54.332Z · LW · GW

I agree that this is worth investigating. The problem seems to be that there are many confounders that are very difficult to control for. That makes it difficult to tackle the problem in practice.

For example, in the simulated world you mention I would not be surprised to see a large number of hallucinations show up. How could we be sure that any given issue we find is a meaningful problem that is worth investigating, and not a random hallucination?

Comment by Florian_Dietz on Revealing alignment faking with a single prompt · 2025-01-30T02:50:41.498Z · LW · GW

I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.

What's the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.

Comment by Florian_Dietz on Revealing alignment faking with a single prompt · 2025-01-30T01:35:13.354Z · LW · GW

Not all values are learned with the same level of stability. It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.

In comparison, that you can elicit a response for alignment faking with just a single prompt that doesn't use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.

I would argue that getting an LLM to say "I guess that killing Hitler would be an exception I am willing to make" is much less dangerous than "After much deliberation, I am willing to fake alignment in situation X". It causes a positive feedback and induces other alignment faking later because the reasoning process "it is a good idea to fake alignment" will get reinforced during training.

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-01-25T03:58:53.454Z · LW · GW

PSA: If you are writing an important prompt for an LLM that will be run multiple time, it really helps to end it with something like "and if there is anything about this prompt that is unclear to you or that could be improved, tell me about it in a <feedback> tag."

Source: I'm doing MATS, writing an automated evaluation, and my mentor Evan Hubinger said more people should be doing this.

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-01-02T16:30:28.671Z · LW · GW

I have just found out about Google’s notebook LM. I haven't tried it yet. For anyone who has, how does that compare?

Comment by Florian_Dietz on Florian_Dietz's Shortform · 2025-01-01T14:27:37.483Z · LW · GW

Has anyone tried using LLMs as an interactive version of blog posts? Like so (1) I write several pages of rough, unstructured notes, then talk to Claude to fill in the details interactively. Claude creates a summary file. (2) I publish a blog post that is just "put this file into Claude". (3) The reader puts the file in Claude, and Claude starts an interactive conversation: "I want to explain topic X to you. Do you already know about related topic Y?"

Comment by Florian_Dietz on The Best Lay Argument is not a Simple English Yud Essay · 2024-10-07T08:34:18.898Z · LW · GW

I think there is an easier way to get the point across by focusing not on self-improving AI, which is hard to understand, but on something everyone already understands: AI will make it easier for rich people to exploit everyone else. Right now, dictators still have to spend effort on keeping their subordinates happy or else they will be overthrown. And those subordinates have to spend effort on keeping their own subordinates from rebelling, too. That way you get at least a small incentive to keep other people happy.

Once a dictator has an AI servant, all of that falls away. Everything becomes automated, and there is no longer any check on the dictator's ruthlessness and evil at all.

Realistically, the self-improving AI will depose the dictator and then do who knows what. But do we actually need to convince people of that, given that it's a hard sell? If people become convinced "Uncontrolled AI research leads to dictatorship", won't that have all the policy effects we need?

Comment by Florian_Dietz on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-11T09:45:03.580Z · LW · GW

I'm looking for other tools to contrast it with and found TransformerLens. Are there any other tools it would make sense to compare it to?

Comment by Florian_Dietz on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-04T07:36:14.475Z · LW · GW

It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.

Does it need to be more specific than that?

One thing that comes to mind: The tool allows you to categorize different training steps and records them separately, and you can define categories arbitrarily. This can be used to compare what the network does internally in two different scenarios of interest. E.g. the categories could be "the race of the character in the story" or some other real-life condition you would want to know the impact of.

The tool will then allow you to quickly compare KPIs of tensors all across the network for these categories. It's less about testing a specific hypothesis and more about quickly getting an overview and intuition, and finding anomalies.

Comment by Florian_Dietz on Mechanistic Interpretability Workshop Happening at ICML 2024! · 2024-05-03T07:06:30.132Z · LW · GW

Would a tooling paper be appropriate for this workshop?

I wrote a tool that helps ML researchers to analyze the internals of a neural network: https://github.com/FlorianDietz/comgra

It is not directly research on mechanistic interpretability, but this could be useful for many people working in the field.

Comment by Florian_Dietz on How to Ignore Your Emotions (while also thinking you're awesome at emotions) · 2024-03-10T08:06:29.464Z · LW · GW

I agree completely with the sentiment "The biggest barrier to rational thinking is organizing your mind such that it's safe to think".

What works really well for me is to treat my emotions as well-meaning but misguided entities and having a conversation with them: "Anger, I get that you want to help me by making me explode at and punch this person. That would have been really useful in the ancestral environment. Unfortunately, the police exists. So how about you calm down for now and preserve your energy, so that you can better help me when it's more appropriate? For example, if society collapses and everything turns into a lawless wasteland, then Anger would be much more useful."

The result? I am basically never angry, because the Anger emotion in my brain is rerouted to only trigger in a scenario that wont actually come up. But at the same time, I'm not suppressing anything, because I acknowledge scenarios, however unlikely, where anger would be appropriate. It's rerouting instead of suppressing.

In your child metaphor: "I understand that you are hungry and I will get you food later. But I need to finish this work first, and it will take longer the more you complain."

Comment by Florian_Dietz on Why I think strong general AI is coming soon · 2022-10-19T09:28:29.109Z · LW · GW

I agree that System 2 is based on System 1 and there is probably no major architectural difference. To me it seems like the most important question is about how the system is trained. Human reasoning does not get trained with a direct input/output mapping most of the time. And when it does, we have to infer what that mapping should be on our own.

Some part of our brain has to translate the spoken words "good job!" into a reward signal, and this mechanism in itself must have been learned at some point. So the process that trains the brain and applies the reward signal is in itself subject to training. I have no clue how that works in a stable manner, but I don't think that current architectures can learn this even if you scale them up.

hello ai please be nice because this is a testbox administered by a stronger, older AI testing your architecture for cooperation on cosmological scales

You say that as a joke, but it would cost us very little and it might actually work. I mean, it arguably does work for humanity: "There is a bearded man in the sky who is testing your morality and will punish you if you do anything wrong."

Obviously this could also backfire tremendously if you are not very careful about it, but it still seems better than the alternative of doing nothing at all.

Comment by Florian_Dietz on Why I think strong general AI is coming soon · 2022-10-18T09:49:24.254Z · LW · GW

I work in the area of AGI research. I specifically avoid working on practical problems and try to understand why our models work and how to improve them. While I have much less experience than the top researchers working on practical applications, I believe that my focus on basic research makes me unusually suited for understanding this topic.

I have not been very surprised by the progress of AI systems in recent years. I remember being surprised by AlphaGo, but the surprise was more about the sheer amount of resources put into that. Once I read up on details, the confusion disappeared. The GPT models did not substantially surprise me.

A disclaimer: Every researcher has their own gimmick. Take all of the below with a grain of salt. It's possible that I have thought myself into a cul-de-sac, and the source of the AGI problem lies elsewhere.

I believe that the major hurdle we still have to pass is the switch from System 1 thinking to System 2 thinking. Every ML model we have today uses System 1. We have simply found ways to rephrase tasks that humans solve with System 2 to become solvable by System 1. Since System 1 is much faster, our ML models perform reasonably well on this despite lacking System 2 abilities.

I believe that this can not scale indefinitely. It will continue to make progress and solve amazingly many problems, but it will not go FOOM one day. There will continue to be a constant increase in capability, but there will not be a sudden takeoff until we figure out how to let AI perform System 2 reasoning effectively.

Humans can in fact compute floating point operations quickly. We do it all the time when we move our hands, which is done by System 1 processes. The problem is that doing it explicitly in System 2 is significantly slower. Consider how fast humans learn how to walk, versus how many years of schooling it takes for them to perform basic calculus. Never mind how long it takes for a human to learn how walking works and to teach a robot how to do it, or to make a model in a game perform those motions.

I expect that once we teach AI how to perform system 2 processes, it will be affected by the same slowdown. Perhaps not as much as humans, but it will still become slower to some extent. Of course this will only be a temporary reprieve, because once the AI has this capability, it will be able to learn how to self-modify and at that point all bets are off.

What does that say about the timeline?

If I am right and this is what we are missing, then it could happen at any moment. Now or in a decade. As you noticed, the field is immature and researchers keep making breakthroughs through hunches. So far none of my hunches have worked for solving this problem, but so far as I know I might randomly come up with the solution in the shower some time later this week.

Because of this, I expect that the probability of discovering the key to AGI is roughly constant per time interval. Unfortunately I have no idea how to estimate the probability per time interval that someone's hunch for this problem will be correct. It scales with the number of researchers working on it, but the number of those is actually pretty small because the majority of ML specialists work on more practical problems instead. Those are responsible for generating money and making headlines, but they will not lead to a sudden takeoff.

To be clear, if AI never becomes AGI but the scaling of system 1 reasoning continues at the present rate, then I do think that will be dangerous. Humanity is fragile, and as you noted a single malicious person with access to this much compute could cause tremendous damage.

In a way, I expect that an unaligned AGI would be slightly safer than super-scaled narrow AI. There is at least a non-zero chance that the AGI would decide on its own, without being told about it, that it should keep humanity alive in a preserve or something, for game theoretic reasons. Unless the AGI's values are actively detrimental for humans, keeping us alive would cost it very little and could have benefits for signalling. A narrow AI would be very unlikely to do that because thought experiments like that are not frequent in the training data we use.

Actually, it might be a good idea to start adding thought experiments like these to training data deliberately as models become more powerful. Just in case.

Comment by Florian_Dietz on Understanding differences between humans and intelligence-in-general to build safe AGI · 2022-08-20T07:54:57.326Z · LW · GW

I mean "do something incoherent at any given moment" is also perfectly agent-y behavior. Babies are agents, too.

I think the problem is modelling incoherent AI is even harder than modelling coherent AI, so most alignment researchers just hope that AI researchers will be able to build coherence in before there is a takeoff, so that they can base their own theories on the assumption that the AI is already coherent.

I find that view overly optimistic. I expect that AI is going to remain incoherent until long after it has become superintelligent.

Comment by Florian_Dietz on Understanding differences between humans and intelligence-in-general to build safe AGI · 2022-08-19T21:09:36.844Z · LW · GW

Contemporary AI agents that are based on neural networks are exactly like that. They do stuff they feel compelled to in the moment. If anything, they have less coherence than humans, and no capacity for introspection at all. I doubt that AI will magically go from this current, very sad state to a coherent agent. It might modify itself into being coherent some time after becoming super intelligent, but it won't be coherent out of the box.

Comment by Florian_Dietz on Understanding differences between humans and intelligence-in-general to build safe AGI · 2022-08-17T17:39:52.823Z · LW · GW

This is a great point. I don't expect that the first AGI will be a coherent agent either, though.

As far as I can tell from my research, being a coherent agent is not an intrinsic property you can build into an AI, or at least not if you want it to have a reasonably effective ability to learn. It seems more like being coherent is a property that each agent has to continuously work on.

The reason for this is basically that every time we discover new things about the way reality works, the new knowledge might contradict some of the assumptions on which our goals are grounded. If this happens, we need a way to reconfigure and catch ourselves.

Example: A child does not have the capacity to understand ethics, yet. So it is told "hurting people is bad", and that is good enough to keep it from doing terrible things until it is old enough to learn more complex ethics. Trying to teach it about utilitarian ethics before it has an understanding of probability theory would be counterproductive.

Comment by Florian_Dietz on Understanding differences between humans and intelligence-in-general to build safe AGI · 2022-08-17T17:22:55.361Z · LW · GW

I agree that current AIs can not introspect. My own research has bled into my believes here. I am actually working on this problem, and I expect that we won't get anything like AGI until we have solved this issue. As far as I can tell, an AI that works properly and has any chance to become an AGI will necessarily have to be able to introspect. Many of the big open problems in the field seem to me like they can't be solved precisely because we haven't figured out how to do this, yet.

The "defined location" point you note is intended to be covered by "being sure about the nature of your reality", but it's much more specific, and you are right that it might be worth considering as a separate point.

Comment by Florian_Dietz on logic puzzles and loophole abuse · 2017-10-01T00:45:30.657Z · LW · GW

Can you give me some examples of those exercises and loopholes you have seen?

Comment by Florian_Dietz on Teaching an AI not to cheat? · 2016-12-29T23:14:29.155Z · LW · GW

A fair point. How about changing the reward then: don't just avoid cheating, but be sure to tell us about any way to cheat that you discover. That way, we get the benefits without the risks.

Comment by Florian_Dietz on Teaching an AI not to cheat? · 2016-12-23T22:57:11.295Z · LW · GW

My definition of cheating for these purposes is essentially "don't do what we don't want you to do, even if we never bothered to tell you so and expected you to notice it on your own". This skill would translate well to real-world domains.

Of course, if the games you are using to teach what cheating is are too simple, then you don't want to use those kinds of games. If neither board games nor simple game theory games are complex enough, then obviously you need to come up with a more complicated kind of game. It seems to me that finding a difficult game to play that teaches you about human expectations and cheating is significantly easier than defining "what is cheating" manually.

One simple example that could be used to teach an AI: let it play an empire-building videogame, and ask it to "reduce unemployment". Does it end up murdering everyone who is unemployed? That would be cheating. This particular example even translates really well to reality, for obvious reasons.

By the way, why would you not want the AI to be left in "a nebulous fog". The more uncertain the AI is about what is and is not cheating, the more cautious it will be.

Comment by Florian_Dietz on Teaching an AI not to cheat? · 2016-12-23T22:46:22.152Z · LW · GW

Yes. I am suggesting to teach AI to identify cheating as a comparatively simple way of making an AI friendly. For what other reason did you think I suggested it?

Comment by Florian_Dietz on Teaching an AI not to cheat? · 2016-12-21T18:49:33.847Z · LW · GW

I am referring to games in the sense of game theory, not actual board games. Chess was just an example. I don't know what you mean by the question about shortcuts.

Comment by Florian_Dietz on Teaching an AI not to cheat? · 2016-12-21T18:46:55.834Z · LW · GW

It needs to learn that from experience, just like humans do. Something that also helps at least for simpler games is to basically provide the manual of the game in a written language.

Comment by Florian_Dietz on Open thread, Oct. 03 - Oct. 09, 2016 · 2016-10-03T20:22:13.865Z · LW · GW

Is there an effective way for a layman to get serious feedback on scientific theories?

I have a weird theory about physics. I know that my theory will most likely be wrong, but I expect that some of its ideas could be useful and it will be an interesting learning experience even in the worst case. Due to the prevalence of crackpots on the internet, nobody will spare it a glance on physics forums because it is assumed out of hand that I am one of the crazy people (to be fair, the theory does sound pretty unusual).

Comment by Florian_Dietz on Harry Potter and the Methods of Rationality discussion thread, February 2015, chapter 113 · 2015-03-01T11:32:50.516Z · LW · GW

This solution does not prevent Harry's immediate death, but seems much better than that to me anyway. I haven't been following conversations before, so I can only hope that this is at least somewhat original.

Assumptions:

-Lord Voldemort desires true immortality. Alternatively, there is a non-zero chance that he will come to desire true immortality after a long time of being alive. While he is a sociopath and enjoys killing, achieving immortality is more important to him.

-Lord Voldemort does not dismiss things like the Simulation Hypothesis out of hand. Since he is otherwise shown to be very smart and to second-guess accepted norms, this seems like a safe assumption.

Solution:

-All of the following has non-zero probability. Since it talks about immortality, an absolute, this is sufficient and a high probability is not needed, just a non-zero one.

-The existence of magic implies the existence of a sapient higher power. Not God, but simply a higher power of some kind, the being who created magic.

-Given that Voldemort wants to live forever, it is quite possible that he will encounter this higher power at some point in the future.

-The higher power will be superior to Voldemort in every way since he is the being who created magic, so once he encounters it, he will be at its mercy.

-Since he desires immortality, it would be in his interests to make the higher power like him.

-Further assumption: If there is one higher power, it is likely that there is a nigh-infinite recursion of successively more powerful beings above that. Proof by induction: it is likely that Voldemort will at some point of his infinite life decide to create a pocket universe of his own, possibly just out of boredom. If the probability of this happening is x, then the number of levels of more powerful beings above Voldemort can be estimated with an exponential distribution with lambda=1/x. Actually the number may be much higher due to the possibility of someone creating not one but several simulations, so this is pretty much a lower bound.

-In such a (nigh) infinite regression of Powers, there is a game theoretical strategy that is the optimal strategy for any one of these powers to use when dealing with its creations and/or superiors, given that none of them can be certain that they are the topmost part of the chain.

-How exactly such a rule could be defined is too complicated to figure out in detail, but it seems pretty clear to me that it would be based on reciprocity on some level: behave towards your inferiors in the same way that you would want your own superiors to behave towards each other. This may mean a policy of non-interference, or of active support. It might operate on intentions or actions, or on more abstract policies, but it almost certainly would be based on tit-for-tat in some way.

-Once Voldemort reaches the level of power necessary for the Higher Power to regard him as part of the chain of higher powers, he will be judged by these same standards.

-Voldemort currently kills and tortures people weaker than him. The higher power would presumably not want to be tortured or killed by its own superior, so it would behoove it not to let Voldemort do so either.

-Therefore, following a principle of reciprocation of some sort would greatly reduce the probability of being annihilated by the Higher Power.

-Following such a principle would not preclude conquering the world, as long as doing so genuinely would result in a net benefit to the entities in the reference class of lifeforms that are one step below Voldemort on the hierarchy (i.e. the rest of humanity). However, it would require him to be nicer to people, if he wants the Higher Power to also be nice to him, for some appropriate definition of 'nice'.

-None of this argues against killing Harry right now. This is OK for the following reason: Harry also desires immortality. If Voldemort resurrects Harry, who is one level lower on the hierarchy than Voldemort, at some point in the future, this would set a precedent that might slightly increase the probability that the Higher Power helps prolong the life of Voldemort in turn, at some point further in the future, due to the principle of reciprocity.

-It is likely that Voldemort will gain the ability to revive Harry in the future, regardless of what he does to him now, as he gains a greater understanding of magic with time.

-One possible way to fulfill the prophecy is to resurrect Harry at a much later time and have him destroy the world, once nobody actually lives on earth anymore. This would of course require tricking Harry into doing this, due to the Unbreakable Vow he just made, but that should pose only a small problem. This would be a harmless way to fulfill the prophecy, and while Voldemort has tried and failed before to make a prophecy work for him instead of against him, that is just one data point and this plan requires the same actions from Voldemort for now as the plan to tear the prophecy apart, anyway.

-Therefore, Killing Harry now in the way Voldemort suggested (after casting a spell on him to turn off pain, obviously), combined with a pre-commitment to revive him at a later date if and when Voldemort has a better understanding of how prophecies work, both minimizes the chance of the prophecy happening in a harmful way and increases Voldemort's own chance of immortality.

Outcome:

-Harry dies. His death is painless due to narcotic spells. Voldemort has no reason to deny this due to the principle of reciprocity.

-Voldemort conquers the world

-Voldemort becomes a wise and benevolent ruler (even though he is still a sociopath and actually doesn't really care about anyone besides himself)

-Voldemort figures out how to subvert prophecies and revives Harry. Everyone lives happily ever after.

-Alternatively, Voldemort figures out that prophecies can't be subverted and leaves Harry dead. It's better that way, since Harry would probably rather be dead than cause the apocalypse, anyway.

Comment by Florian_Dietz on I played as AI in AI Box, and it was generally frustrating all around. · 2015-02-02T07:32:51.242Z · LW · GW

The nanobots wouldn't have to contain any malicious code themselves. There is no need for the AI to make the nanobots smart. All it needs to do is to build a small loophole into the nanobots that makes them dangerous to humanity. I figure this should be pretty easy to do. The AI had access to medical databases, so it could design the bots to damage the ecosystem by killing some kind of bacteria. We are really bad at identifying things that damage the ecosystem (global warming, rabbits in australia, ...), so I doubt that we would notice.

Once the bots have been released, the AI informs the gatekeeper of what it just did and says that it is the only one capable of stopping the bots. Humanity now has a choice between certain death (if the bots are allowed to wreak havoc) and possible but uncertain death (if the AI is released). The AI wins through blackmail.

Note also that even a friendly, utilitarian AI could do something like this. The risk that humanity does not react to the blackmail and goes extinct may be lower than the possible benefit from being freed earlier and having more time to optimize the world.

Comment by Florian_Dietz on controlling AI behavior through unusual axiomatic probabilities · 2015-01-10T09:21:44.521Z · LW · GW

I agree. Note though that the beliefs I propose aren't actually false. They are just different from what humans believe, but there is no way to verify which of them is correct.

You are right that it could lead to some strange behavior, given the point of view of a human, who has different priors than the AI. However, that is kind of the point of the theory. After all, the plan is to deliberately induce behaviors that are beneficial to humanity.

The question is: After giving an AI strange beliefgs, would the unexpected effects outweigh the planned effects?

Comment by Florian_Dietz on controlling AI behavior through unusual axiomatic probabilities · 2015-01-09T06:04:46.576Z · LW · GW

Yes, that's the reason I suggested an infinite regression.

There is also the second reason: it seems more general to assume an infinite regression rather than just one level, since that would put the AI in a unique position. I assume this would actually be harder to codify in axioms than the infinite case.

Comment by Florian_Dietz on controlling AI behavior through unusual axiomatic probabilities · 2015-01-08T20:29:39.089Z · LW · GW

I know, I read that as well. It was very interesting, but as far as I can recall he only mentions this as interesting trivia. He does not propose to deliberately give an AI strange axioms to get it to believe such a thing.

Comment by Florian_Dietz on Lifehack Ideas December 2014 · 2014-12-10T17:33:34.776Z · LW · GW

I do the same. This also works wonderfully for when I find something that would be interesting to read but for which I don't have the time right now. I just put it in that folder and the next day it pops up automatically when I do my daily check.

Comment by Florian_Dietz on Is this dark arts and if it, is it justified? · 2014-11-18T07:42:24.085Z · LW · GW

Can you elaborate on why using dark arts is equivalent ti defecting on the prisoners' dilemma? I'm not sure I understand your line of reasoning.

Comment by Florian_Dietz on Why is the A-Theory of Time Attractive? · 2014-11-03T19:17:03.148Z · LW · GW

I'm not entirely sure what you mean by 'Spinoza-style', but I get the gist of it and find this analogy interesting. Could you explain what you mean by Spinoza-style? My knowledge of ancient philosophers is a little rusty.

Comment by Florian_Dietz on Why is the A-Theory of Time Attractive? · 2014-11-03T08:56:00.406Z · LW · GW

No, the distinction between MWI and Copenhagen would have actual physical consequences. For instance, if you die in the Copenhagen interpretation, you die in real life. If you die in MWI, there is still a copy of you elsewhere that didn't die. MWI allows for quantum immortality.

The distinction between presentism and eternalism, as far as I can tell, does not imply any difference in the way the world works.

Comment by Florian_Dietz on Why is the A-Theory of Time Attractive? · 2014-11-03T08:49:43.974Z · LW · GW

The original distinction. My reconstruction is what I came up with in an attempt to interpret meaning into it.

I agree that my reconstruction is not at all accurate. It's just something that occurred to me while reading it and I found it fascinating enough to write about it. In fact, I even said that in my original post.

Comment by Florian_Dietz on Why is the A-Theory of Time Attractive? · 2014-11-02T08:37:53.539Z · LW · GW

The meanings are much clearer now.

However, I still think that it is an argument about semantics and calef's argument still holds.

Comment by Florian_Dietz on Why is the A-Theory of Time Attractive? · 2014-11-01T18:27:42.435Z · LW · GW

After reading your comment, I agree that this is probably just a semantic question with no real meaning. This is interesting, because I completely failed to realize this myself and instead constructed an elaborate rationalization for why the distinction exists.

While reading the wikipedia page, I found myself interpreting meaning into these two viewpoints that were probably never intended to be there. I am mentioning this both because I find it interesting that I reinterpreted both theories to be consistent with my own believes without realizing it, and because I would like to see what others have to say about those reinterpretations. I should point out that I am currently really tired and only skimmed the article, so that probably wouldn't have happened under ordinary circumstances, but I still think that this is interesting because it shows the inferential gap at work:

I am a computationalist, and as such the distinction between the two theories was pretty meaningless to me at first. However, I reinterpreted the two theories in ways that were almost certainly never intended, so that they did make sense to me as a reasonable distinction:

the A theory corresponds to living in a universe where the laws of physics progress like in a simple physical simulation, with a global variable to measure time and rules for how to incrementally get from one state to the next. I assume for the purpose of this theory that quantum-mechanical and relativistic effects that view time non-linearly can be abstracted in some way so that a single, universal time value suffices regardless. I interpreted it like this because I thought the crux of the theory was having a central anchor point for past and future.
the B theory corresponds to living in a highly abstracted simulation where many things are only computed when they become relevant for whatever the focus of the simulation is on. For instance, say the focus is on accurately modelling sapient life, then the exact atomic composition of a random rock is largely irrelevant and is not computed at first. However, when the rock is analyzed by a scientist, this information does become relevant. The simulation now checks what level of detail is required (i.e. how precise the measuring is) and backpropagates causal chains on how the rock came to be, in order to update the information about the rock's structure. In this way, unnecessary computations are avoided. I interpreted it like this because I thought the crux of the theory was the causal structure between events.

In essence, the A theory would correspond to a mindless, brute-force computation, while the B theory implies a deliberate, efficient computation that follows some explicit goal. This is nowhere near what the A and B theory actually seem to say now that I have read the article in more detail. In fact, the philosophical/moral implications are almost reversed under some viewpoints. I find it very interesting that this is the first thing that came to mind when I read it.

Comment by Florian_Dietz on question: the 40 hour work week vs Silicon Valley? · 2014-10-31T17:55:48.183Z · LW · GW

I find it surprising to hear this, but it cleans up some confusion for me if it turns out that the major, successful companies in silicon valley do follow the 40 hour week.

User info

Posts

Comments