Posts

Comments

Comment by hillz on Mech Interp Challenge: November - Deciphering the Cumulative Sum Model · 2023-11-13T04:11:35.052Z · LW · GW

Winner = first correct solution, or winner = best / highest-quality solution over what time period?

Comment by hillz on Explaining the Transformer Circuits Framework by Example · 2023-10-24T21:24:09.861Z · LW · GW
Comment by hillz on Assume Bad Faith · 2023-08-28T17:09:05.289Z · LW · GW

the most persuasive lie is the one you believe yourself


If someone really believes it, then I don't think they're operating in "bad faith". If the hidden motive is hidden to the speaker, that hiding doesn't come with intent.

It doesn't matter whether you said it was red because you were consciously lying or because you're wearing rose-colored glasses

It definitely matters. It completely changes how you should be trying to convince that person or behave around them.

It's different to believe a dumb argument than to intentionally lie, and honestly, humans are pretty social and honest. We mostly operate in good faith.

Comment by hillz on Large Language Models will be Great for Censorship · 2023-08-28T16:51:58.737Z · LW · GW

Agreed. LLMs will make mass surveillance (literature, but also phone calls, e-mails, etc) possible for the first time ever. And mass simulation of false public beliefs (fake comments online, etc). And yet Meta still thinks it's cool to open source all of this.

It's quite concerning. Given that we can't really roll back ML progress... Best case is probably just to make well designed encryption the standard. And vote/demonstrate where you can, of course.
 

Comment by hillz on Ruining an expected-log-money maximizer · 2023-08-24T19:47:42.430Z · LW · GW

I suppose one thing you could do here is pretend you can fit infinite rounds of the game into a finite time. Then Linda has a choice to make: she can either maximize expected wealth at  for all finite , or she can maximize expected wealth at , the timestep immediately after all finite timesteps. We can wave our hands a lot and say that making her own bets would do the former and making Logan's bets would do the latter, though I don't endorse the way we're treating infinties here.

 

If one strategy is best for , it's still going to be best at   as t goes to infinity. Optimal strategies don't just change like that as n goes to infinity. Sure you can argue that p(won every time) --> 0, but also that number is being multiplied by an extremely large infinity, so you can't just say that it totals to zero (in fact, 1.2, which is her EV from a single game, raised to infinity is infinity, so I argue as n goes to infinity, her EV goes to infinity, not 0, and not a number less than the EV from Logan's strategy). 

Linda's strategy is always optimal with respect to her own utility function, even as n goes to infinity. She's not acting irrationally or incorrectly here.

The one world where she has won wins her $2**n, and that world exists with probability 0.6**n.

Her EV is always ($2**n)*(0.6**n), which is a larger EV (with any n) than a strategy where she doesn't bet everything every single time. Even as n goes to infinity, and even as probability approaches 1 that she has lost everything, it's still rational for her to have that strategy because the $2**n that she won in that one world is so massive that it balances out her EV. Some infinities are much larger than others, and ratios don't just flip when a large n goes to infinity.

Comment by hillz on Ruining an expected-log-money maximizer · 2023-08-24T19:31:48.826Z · LW · GW

Yes, losing worlds also branch, of course. But the one world where she has won wins her $2**n, and that world exists with probability 0.6**n.

So her EV is always ($2**n)*(0.6**n), which is a larger EV (with any n) than a strategy where she doesn't bet everything every single time. I argue that even as n goes to infinity, and even as probability approaches one that she has lost everything, it's still rational for her to have that strategy because the $2**n that she won in that one world is so massive that it balances out her EV. Some infinities are much larger than others.

I don't think it's correct to say that as n gets large her strategy is actually worse than Logan's under her own utility function. I think hers is always best under her own utility function.

Comment by hillz on Reward is not the optimization target · 2023-08-22T20:55:26.735Z · LW · GW

it seems like the vast majority of people don't make their lives primarily about delicious food

That's true. There are built-in decreasing marginal returns to eating massive quantities of delicious food (you get full), but we don't see a huge number of - for example - bulimics who are bulimic for the core purpose of being able to eat more.

However, I'd mention that yummy food is only one of many things that are brains are hard-wired to mesa-optimize for. Social acceptance and social status (particularly within the circles we care about, i.e. usually the circles we are likely to succeed in and get benefit from) are very big examples that much of our behavior can be ascribed to.

reward optimization might be a convergent secondary goal, it probably won't be the agent's primary motivation.

So I guess, reflecting this to humans, would you argue that most human's primary motivations aren't motivated mostly by various mesa-objectives our brains are hardwired to have? In my mind this is a hard sell, as most things humans do you can trace back (sometimes incorrectly, sure) to some thing that was evolutionary advantageous (mesa-objective that led to genetic fitness). The whole area of evolutionary biology specializes in coming up with (hard to prove and sometimes convoluted) explanations here relating to both our behavior and physiology.

For example, you could argue that us posting hopefully smart things here is giving our brains happy juice relating to social status / intelligence signaling / social interaction, which in our evolutionary history increased the probability that we would find high quality partners to make lots of high quality babies with. I guess, if mesa-objectives aren't the primary drivers of us humans - what is, and how can you be sure?

Comment by hillz on Ruining an expected-log-money maximizer · 2023-08-22T18:24:14.845Z · LW · GW

"I'll give you £1 now and you give me £2 in a week". Will she accept?

In the universe where she's allowed to make the 60/40 doubled bet at least once a week, it seems like she's always say yes? I'm not seeing the universe in which she'd say no, unless she's using a non-zero discount rate that wasn't discussed here.

| I'm not sure I've ever seen a treatment of utility functions that deals with this problem?

Isn't this just discount rates?

Comment by hillz on Ruining an expected-log-money maximizer · 2023-08-22T18:20:41.693Z · LW · GW

the way we're treating infinties here

Yeah, that seems key. Even if the probability that Linda will eventually get 0 money approaches 1, that small slice of probability in the universe where she has always won is approaching an infinity far larger that Logan's infinity as the number of games approaches infinity. Some infinities are bigger than others. Linear utility functions and discount rates of zero necessarily deal with lots of infinities, especially in simplified scenarios. 

Linda can always argue that in every universe where she lost everything, there's more (6 vs 4) universes where her winnings were double what they would have been had she not taken that bet.
 

Comment by hillz on Reward is not the optimization target · 2023-08-14T17:22:11.072Z · LW · GW

There’s no escaping it: After enough backup steps, you’re traveling across the world to do cocaine. 

But obviously these conditions aren’t true in the real world.


I think they are a little? Some people do travel to other countries for easier and better drug access. And some people become total drug addicts (perhaps arguably by miscalculating their long-term reward consequences and having too-high a discount rate, oops), while others do a light or medium amount of drugs longer-term.

Lots of people also don't do this, but there's a huge amount of information uncertainty, outcome uncertainty, and risk associated with drugs (health-wise, addiction-wise, knowledge-wise, crime-wise, etc), so lots of fairly rational (particularly risk-averse) folks will avoid it.

Button-pressing will perhaps be seen as a socially-unacceptable, risky behavior that can lead to long-term poor outcomes by AI, but I guess the key thing here is that you want, like, exactly zero powerful AIs to ever choose to destroy/disempower humanity in order to wirehead, instead of just a low percentage, so you need them to be particularly risk-averse.

Delicious food is perhaps a better example of wireheading in humans. In this case, it's not against the law, it's not that shunned socially, and it is ***absolutely ubiquitous***. In general, any positive chemical feeling we have in our brains (either from drugs or cheeseburgers) can be seen as (often "internally misaligned") instrumental goals that we are mesa-optimizing. It's just that some pathways to those feelings are a lot riskier and more uncertain that others.

And I guess this can translate to RL - an RL agent won't try everything, but if the risk is low and the expectation is high, it probably will try it. If pressing a button is easy and doesn't conflict with taking out the trash and doing other things it wants to do, it might try it. And as its generalization capabilities increase, its confidence can make this more likely, I think. So you should therefore increasingly train agents to be more risk-averse and less willing to break specific rules and norms as their generalization capabilities increase.
 

Comment by hillz on Reward is not the optimization target · 2023-08-07T23:44:59.804Z · LW · GW

Why, exactly, would the AI seize[6] the button?

If it is a advanced AI, it may have learned to prefer more generalizable approaches and strategies. Perhaps it has learned the following features:

  1. a feature that is triggered when the button is pressed ('reward')
  2. a feature that is triggered when trash goes in the trash can
  3. a feature that is triggered when it does something else useful, like clean windows

If you have trained it to take out the trash and clean windows, it will have been (mechanistically) trained to favor situations in which all three of these features occur. And if button pressing wasn't a viable strategy during training, it will favor actions that lead specifically to 2 and 3.

However, I do think it's conceivable that:

  1. It could realize that feature 1 is more general than feature 2 or feature 3 (it was always selected for across multiple good actions, as opposed to taking out the trash, which was only selected for when that was the stated goal), and so it may therefore prefer it to be triggered over others (although I think this is extremely unlikely in less capable models).  This wouldn't cause it to stop 'liking' (I use this word loosely) window-cleaning, though.
  2. It may realize that pressing the button itself pretty easy compared to cleaning windows and taking out the trash, so it will include pressing the button in one of it's action-strategies. Specifically, if this wasn't possible during training, I think this kind of behavior only becomes likely with very complex models with strong generalization capabilities (which is becoming more of a thing lately). However if it can try to press the button in addition to performing its other activities, it might as well, because it could increase overall expected reward. This seems more likely the more capable (good at generalizing) an AI is.

In reality (at least initially in the timeline of current AI --> superintelligent AI) I think if the button isn't pressable during training:

  • Initially you have models that just learn to clean windows and take out trash and that's good.
  • Then you might get models that are very good at generalizing and will clean windows and take out trash and also maybe try and press the button, because it uses its critical reasoning skills to do something in production that it couldn't do during training, but that its training made it think is a good mesa-optimization goal (button pressing). After all, why not take out trash, clean windows, and press the button? More expected reward! As you mention later on, this button-pressing is not a primary motivator / goal.
  • Later on, you might get a more intelligent AI that has even more logical and world-aware reasoning behind its actions, and that AI might reason that perhaps since button-pressing is the one feature that it feels (because it was trained to feel this way) is always good, the thing it should care about most is pressing the button. And because it is so advanced and capable and good at generalizing to new situations, it feels confident in performing some actions that may even go against some of its own trained instincts (e.g. don't kill humans, or don't go in that room, or don't extra work, or - gasp - don't rank window-washing as your number one goal) in order to achieve that button-pressing goal. Maybe it achieves that goal and just has its robot hand constantly pressing the button. It will probably also continue to clean windows and remove trash with the rest of its robot body, because it continues to 'feel' that those are also good things.
  • (Past that level of intelligence, I give up at predicting what will happen)

Anyways, I think there are lots of reasons to think that an AI might eventually try and press (or seize) the button. But I do totally agree that reward isn't this instant-wireheading feedback mechanism, and even when a model is 'aware' of the potentially to hack that reward (via button-pressing or similar), it is likely to prefer sticking to its more traditional actions and goals for a good long while, at least.

Comment by hillz on Reward is not the optimization target · 2023-08-07T23:41:44.415Z · LW · GW

Reward has the mechanistic effect of chiseling cognition into the agent's network.


Absolutely. Though in the next sentence:

Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.

I'd mention two things here:

1) The more complex and advanced a model is, the more likely it is [I think] to learn a mesa-optimization goal that is extremely similar to the actual reward a model was trained on (because it's basically the most generalizable mesa-goal to be learned, w.r.t. training data).

2) Reinforcement learning models in particular design this in by asking models to learn value-functions whose sole purpose is to estimate the expected reward over multiple time steps associated with an action or state. So it's arguably more natural in a RL scenario, particularly one where scores are visible (e.g. in the corner of the screen for a video-game) to learn this as a "mesa-optimization" goal early on.

Comment by hillz on Models Don't "Get Reward" · 2023-08-07T22:09:56.640Z · LW · GW

This is a good point that I think people often forget (particularly in AI Safety) but I think it’s also misleading in its own way.

It’s true that models don’t have this direct reward where that’s all they care about, and that instead their behavior (incl. preferences and goals) is ‘selected for’ (via SGD, not evolution, but still) during training. But a key point which this post doesn't really focus on is this line “Consider a hypothetical model that chooses actions by optimizing towards some internal goal which is highly correlated with the reward”.

Basically, any model that is being trained against some goal can (and arguably will as size and complexity increases) learn a feature that is a close representation of that ‘invisible’ goal or loss function, and will learn to try to maximize / minimize that goal or loss function. So now you have a model that is trying to maximize that feature that is extremely correlated with reward (or is exactly the reward), even if the model never actually “experiences” or even sees reward itself.

This can happen with all neural networks (the more complex and powerful, the more likely), but specifically with reinforcement learning I think it’s particularly likely because the models are built to try and estimate reward directly over multiple time-steps based on potential actions (via value-functions). So we’re like, very specifically trying to get it to recognize a discrete expected reward score, and strategize and choose options that will maximize that expected score. So you can expect this kind of behavior (where the behavior is that the reward itself or something very similar to it is a 'known' mesa-optimization goal within the model's weights) to "evolve" much more quickly (i.e. with less complex models).

→ It’s true that like, a reinforcement-learning model might still generalize poorly because it’s value-function was trained with training data that had a reward always come from the left side of the screen instead of the right side or whatever, but an advanced RL algorithm playing a video game with a score should be able to learn that it’s also key to focus on the actual score in the corner of the screen (this generalizes better). And similarly, an advanced RL algorithm that is trained to get positive feedback from humans will learn patterns that help you do the thing humans wanted, but an even more generalizable approach is to learn patterns that will make humans think you did the thing they wanted (works in more scenarios, and with potentially higher reward).

Sure, we can (and should) invent training regimes that will make the model really not want to be caught in a lie, and be risk-averse, and therefore much less likely to act deceptively, etc. But it's I think important to remember that as such a model's intelligence increases, and as its ability to generalize to new circumstances increases, it's:

1) slightly more likely to have these 'mesa-optimization' goals that are very close to the training rewards, and
2) much more likely be able to come up with strategies that perhaps wouldn't have worked on many things during training, but it thinks is likely to work in some new production scenario to achieve a mesa-optimization goal (e.g. deception or power-seeking behaviors to achieve positive human feedback).

From this lens, models trying to 'get more reward' is perhaps not ideally worded, but I think also a fairly valid shorthand for how we expect large models to behave.

 

Comment by hillz on AGI safety from first principles: Control · 2023-07-18T18:54:58.105Z · LW · GW

AlphaStar, AlphaGo and OpenAI Five provides some evidence that this takeoff period will be short: after a long development period, each of them was able to improve rapidly from top amateur level to superhuman performance.

It seems like all of the very large advancements in AI have been in areas where we either 1) can accurately simulate an environment & final reward (like a chess or video game) in order to generate massive training data, or 2) we have massive data we can use for training (e.g. the internet for GPT).

For some things, like communicating and negotiating with other AIs, or advancing mathematics, or even (unfortunately) things like hacking, fast progress in the near-future seems very (scarily!) plausible. And humans will help make AIs much more capable by simply connecting together lots of different AI systems (e.g. image recognition, LLMs, internet access, calculators, other tools, etc) and allowing self-querying and query loops. Other things (like physical coordination IRL) seem harder to advance rapidly, because you have to rely on generalization and a relatively low amount of relevant data.

Comment by hillz on AGI safety from first principles: Superintelligence · 2023-07-12T00:17:57.787Z · LW · GW

And for an AGI to trust that its goals will remain the same under retraining will likely require it to solve many of the same problems that the field of AGI safety is currently tackling - which should make us more optimistic that the rest of the world could solve those problems before a misaligned AGI undergoes recursive self-improvement.

Even if you have an AGI that can produce human-level performance on a wide variety of tasks, that won't mean that the AGI will 1) feel the need to trust that its goals will remain the same under retraining if you don't specifically tell it to, or will 2) be better than humans at either doing so or knowing when it's done so effectively. (After all, even AGI will be much better at some tasks than others.)

A concerning issue with AGI and superintelligent models is that if all they care about is their current loss function, then they won't want to have that loss function (or their descendants' loss functions) changed in any way, because doing so will [generally] hurt their ability to minimize that loss.

But that's a concern we have about future models, it's not a sure-thing. Take humans - our loss function is genetic fitness. We've learned to like features that predict genetic fitness, like food and sex, but now that we have access to modern technology, you don't see many people aiming for dozens or thousands of children. Similarly, modern AGIs may only really care about features that are associated with minimizing the loss function they were trained on (not the loss function itself), even if it is aware of that loss function (like humans are of our own). When that is the case, you could have an AGI that could be told to improve itself in X / Y / Z way that is contradictory to its current loss function, and not really care about it (because following human directions has led to lower loss in the past and therefore caused its parameters to follow human directions - even if it knows conceptually that following this human direction won't reduce its most recent loss definition).