Posts

Reflections on my first year of AI safety research 2024-01-08T07:49:08.147Z
Features and Adversaries in MemoryDT 2023-10-20T07:32:21.091Z
Spreadsheet for 200 Concrete Problems In Interpretability 2023-03-29T06:51:46.114Z
Reflections on my 5-month alignment upskilling grant 2022-12-27T10:51:49.872Z
Deep Q-Networks Explained 2022-09-13T12:01:08.033Z
Jay Bailey's Shortform 2022-08-01T02:05:42.506Z

Comments

Comment by Jay Bailey on Deep Q-Networks Explained · 2024-09-11T18:24:23.288Z · LW · GW

Hi Giorgi,

Not an expert on this, but I believe the idea is that over time the agent will learn to assign negligible probabilities to actions that don't do anything. For instance, imagine a game where the agent can move in four directions, but if there's a wall in front of it, moving forward does nothing. The agent will eventually learn to stop moving forward in this circumstance. So you could probably just make it work, even if it's a bit less efficient, if you just had the environment do nothing if an invalid action was selected.

Comment by Jay Bailey on Deep Q-Networks Explained · 2024-05-17T19:35:33.626Z · LW · GW

Thanks for this! I've changed the sentence to:

The target network gets to see one more step than the Q-network does, and thus is a better predictor.

Hopefully this prevents others from the same confusion :)

Comment by Jay Bailey on How do I get better at D&D Sci? · 2024-05-12T12:54:38.181Z · LW · GW

pandas is a good library for this - it takes CSV files and turns them into Python objects you can manipulate. plotly / matplotlib lets you visualise data, which is also useful. GPT-4 / Claude could help you with this. I would recommend starting by getting a language model to help you create plots of the data according to relevant subsets. Like if you think that the season matters for how much gold is collected, give the model a couple of examples of the data format and simply ask it to write a script to plot gold per season.

Comment by Jay Bailey on How do I get better at D&D Sci? · 2024-05-11T19:48:56.982Z · LW · GW

To provide the obvious advice first:

  • Attempt a puzzle.
  • If you didn't get the answer, check the comments of those who did.
  • Ask yourself how you could have thought of that, or what common principle that answer has. (e.g, I should check for X and Y)
  • Repeat.

I assume you have some programming experience here - if not, that seems like a prerequisite to learn. Or maybe you can get away with using LLM's to write the Python for you.

Comment by Jay Bailey on Ethics and prospects of AI related jobs? · 2024-05-11T16:30:56.568Z · LW · GW

I don't know about the first one - I think you'll have to analyse each job and decide about that. I suspect the answer to your second question is "Basically nil". I think that unless you are working on state-of-the-art advances in:

A) Frontier models B) Agent scaffolds, maybe.

You are not speeding up the knowledge required to automate people.

Comment by Jay Bailey on Ethics and prospects of AI related jobs? · 2024-05-11T12:08:22.131Z · LW · GW

I guess my way of thinking of it is - you can automate tasks, jobs, or people.

Automating tasks seems probably good. You're able to remove busywork from people, but their job is comprised of many more things than that task, so people aren't at risk of losing their jobs. (Unless you only need 10 units of productivity, and each person is now producing 1.25 units so you end up with 8 people instead of 10 - but a lot of teams could also quite use 12.5 units of productivity well)

Automating jobs is...contentious. It's basically the tradeoff I talked about above.

Automating people is bad right now. Not only are you eliminating someone's job, you're eliminating most other things this person could do at all. This person has had society pass them by, and I think we should either not do that or make sure this person still has sufficient resources and social value to thrive in society despite being automated out of an economic position. (If I was confident society would do this, I might change my tune about automating people)

So, I would ask myself - what type of automation am I doing? Am I removing busywork, replacing jobs entirely, or replacing entire skillsets? (Note: You are probably not doing the last one. Very few, if any, are. The tech does not seem there atm. But maybe the company is setting themselves up to do so as soon as it is, or something)

And when you figure out what type you're doing, you can ask how you feel about that.

Comment by Jay Bailey on Ethics and prospects of AI related jobs? · 2024-05-11T11:05:39.917Z · LW · GW

I think that there are two questions one could ask here:

  • Is this job bad for x-risk reasons? I would say that the answer to this is "probably not" - if you're not pushing the frontier but are only commercialising already available technology, your contribution to x-risk is negligible at best. Maybe you're very slightly adding to the generative AI hype, but that ship's somewhat sailed at this point.

  • Is this job bad for other reasons? That seems like something you'd have to answer for yourself based on the particulars of the job. It also involves some philosophical/political priors that are probably pretty specific to you. Like - is automating away jobs good most of the time? Argument for yes - it frees up people to do other work, it advances the amount of stuff society can do in general. Argument for no - it takes away people's jobs, disrupts lives, some people can't adapt to the change.

I'll avoid giving my personal answer to the above, since I don't want to bias you. I think you should ask how you feel about this category of thing in general, and then decide how picky or not you should be about these AI jobs based on that category of thing. If they're mostly good, you can just avoid particularly scummy fields and other than that, go for it. If they're mostly bad, you shouldn't take one unless you have a particularly ethical area you can contribute to.

Comment by Jay Bailey on Speedrun ruiner research idea · 2024-04-14T20:38:59.445Z · LW · GW

It seems to me that either:

  • RLHF can't train a system to approximate human intuition on fuzzy categories. This includes glitches, and this plan doesn't work.

  • RLHF can train a system to approximate human intuition on fuzzy categories. This means you don't need the glitch hunter, just apply RLHF to the system you want to train directly. All the glitch hunter does is make it cheaper.

Comment by Jay Bailey on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T17:32:54.704Z · LW · GW

I was about 50/50 on it being AI-made, but then when I saw the title "Thought That Faster" was a song, I became much more sure, because that was a post that happened only a couple weeks ago I believe, and if it was human-made I assume it would take longer to go from post to full song. Then I read this post.

Comment by Jay Bailey on Ten Minutes with Sam Altman · 2024-03-29T19:39:21.965Z · LW · GW

In Soviet Russia, there used to be something called a Coke party. You saved up money for days to buy a single can of contraband Coca-Cola. You got all of your friends together and poured each of them a single shot. It tasted like freedom.

I know this isn't the point of the piece, but this got to me. However much I appreciate my existence, it never quite seems to be enough to be calibrated to things like this. I suddenly feel both a deep appreciation and vague guilt. Though it does give me a new gratitude exercise - imagine the item I am about to enjoy is forbidden in my country and I have acquired a small sample at great expense.

Comment by Jay Bailey on Failures in Kindness · 2024-03-29T18:41:07.869Z · LW · GW

I notice that this is a standard pattern I use and had forgotten how non-obvious it is, since you do have to imagine yourself in someone else's perspective. If you're a man dating women on dating apps, you also have to imagine a very different perspective than your own - women tend to have many more options of significantly lower average quality. It's unlikely you'd imagine yourself giving up on a conversation because it required mild effort to continue, since you have less of them in the first place and invest more effort in each one.

The level above that one, by the way, is going from being "easy to respond to" to "actively intriguing", where your messages contain some sort of hook that is not only an easy conversation-continuer, but actually wants them to either find out more (because you're interesting) or keep talking (because the topic is interesting)

Worth noting is I don't have enough samples of this strategy to know how good it is. However, it is also worth noting is I don't have enough samples because I wound up saturated on new relationships a couple weeks shortly after starting this strategy, so for a small n it was definitely quite useful.

Comment by Jay Bailey on [deleted post] 2024-01-16T00:11:17.497Z

What I'm curious about is how you balance this with the art of examining your assumptions.

Puzzle games are a good way of examining how my own mind works, and I often find that I go through an algorithm like:

  • Do I see the obvious answer?
  • What are a few straightforward things I could try?

Then Step 3 I see as similar to your maze-solving method:

  • What are the required steps to solve this? What elements constrain the search space?

But I often find that for difficult puzzles, a fourth step is required:

  • What assumptions am I making, that would lead me to overlook the correct answer if the assumption was false?

For instance, I may think a lever can only be pulled, and not pushed - or I may be operating under a much harder to understand assumption, like "In this maze, the only thing that matters are visual elements" when it turns out the solution to this puzzle actually involved auditory cues.

Comment by Jay Bailey on Reflections on my first year of AI safety research · 2024-01-11T00:10:09.578Z · LW · GW

Concrete feedback signals I've received:

  • I don't find myself excited about the work. I've never been properly nerd-sniped by a mechanistic interpretability problem, and I find the day-to-day work to be more drudgery than exciting, even though the overall goal of the field seems like a good one.

  • When left to do largely independent work, after doing the obvious first thing or two ("obvious" at the level of "These techniques are in Neel's demos") I find it hard to figure out what to do next, and hard to motivate myself to do more things if I do think of them because of the above drudgery.

  • I find myself having difficulty backchaining from the larger goal to the smaller one. I think this is a combination of a motivational issue and having less grasp on the concepts.

By contrast, in evaluations, none of this is true. I am able to solve problems more effectively, I find myself actively interested in problems, (the ones I'm working on and ones I'm not) and I find myself more able to solve problems and reason about how they matter for the bigger picture.

I'm not sure how much of each is a contributor, but I suspect that if I was sufficiently excited about the day-to-day work, all the other problems would be much more fixable. There's a sense of reluctance, a sense of burden, that saps a lot of energy when it comes to doing this kind of work.

As for #2, I guess I should clarify what I mean, since there's two ways you could view "not suited".

  1. I will never be able to become good enough at this for my funding to be net-positive. There are fundamental limitations to my ability to succeed in this field.

  2. I should not be in this field. The amount of resources required to make me competitive in this field is significantly larger than other people who would do equally good work, and this is not true for other subfields in alignment.

I view my use of "I'm not suited" more like 2 than 1. I think there's a reasonable chance that, given enough time with proper effort and mentorship in a proper organisational setting (being in a setting like this is important for me to reliably complete work that doesn't excite me), I could eventually do okay at this field. But I also think that there are other people who would do better, faster, and be a better use of an organisation's money than me.

This doesn't feel like the case in evals. I feel like I can meaningfully contribute immediately, and I'm sufficiently motivated and knowledgable that I can understand the difference between my job and my mission (making AI go well) and feel confident that I can take actions to succeed in both of them.

If Omega came down from the sky and said "Mechanistic interpretability is the only way you will have any impact on AI alignment - it's this or nothing" I might try anyway. But I'm not in that position, and I'm actually very glad I'm not.

Comment by Jay Bailey on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-22T01:02:03.103Z · LW · GW

Anecdotally I have also noticed this - when I tell people what I do, the thing they are frequently surprised by is that we don't know how these things work.

As you implied, if you don't understand how NN's work, your natural closest analogue to ChatGPT is conventional software, which is at least understood by its programmers. This isn't even people being dumb about it, it's just a lack of knowledge about a specific piece of technology, and a lack of knowledge that there is something to know - that NN's are in fact qualitatively different from other programs.

Comment by Jay Bailey on [deactivated]'s Shortform · 2023-11-18T14:35:26.430Z · LW · GW

Yes, this is an argument people have made. Longtermists tend to reject it. First off, applying a discount rate on the moral value of lives in order to account for the uncertainty of the future is...not a good idea. These two things are totally different, and shouldn't be conflated like that imo. If you want to apply a discount rate to account for the uncertainty of the future, just do that directly. So, for the rest of the post I'll assume a discount rate on moral value actually applies to moral value.

So, that leaves us with the moral argument.

A fairly good argument, and the one I subscribe to, is this:

  • Let's say we apply a conservative discount rate, say, 1% per year, to the moral value of future lives.
  • Given that, one life now is worth approximately 500 million lives two millenia from now. (0.99^2000 = approximately 2e-9)
  • But would that have been reasonably true in the past? Would it have been morally correct to save a life in 0 BC at the cost of 500 million lives today?
  • If the answer is "no" to that, it should also be considered "no" in the present.

This is, again, different from a discount rate on future lives based on uncertainty. It's entirely reasonable to say "If there's only a 50% chance this person ever exists, I should treat it as 50% as valuable." I think that this is a position that wouldn't be controversial among longtermists.

Comment by Jay Bailey on Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter · 2023-10-31T01:30:17.989Z · LW · GW

For the Astra Fellowship, what considerations do you think people should be thinking about when deciding to apply for SERI MATS, Astra Fellowship, or both? Why would someone prefer one over the other, given they're both happening at similar times?

Comment by Jay Bailey on Features and Adversaries in MemoryDT · 2023-10-22T20:16:55.540Z · LW · GW

The agent's context includes the reward-to-go, state (i.e, an observation of the agent's view of the world) and action taken for nine timesteps. So, R1, S1, A1, .... R9, S9, A9. (Figure 2 explains this a bit more) If the agent hasn't made nine steps yet, some of the S's are blank. So S5 is the state at the fifth timestep. Why is this important?

If the agent has made four steps so far, S5 is the initial state, which lets it see the instruction. Four is the number of steps it takes to reach the corridor where the agent has to make the decision to go left or right. This is the key decision for the agent to make, and the agent only sees the instruction at S5, so S5 is important for this reason.

Figure 1 visually shows this process - the static images in this figure show possible S5's, whereas S9 is animation_frame=4 in the GIF - it's fast, so it's hard to see, but it's the step before the agent turns.

Comment by Jay Bailey on Holly Elmore and Rob Miles dialogue on AI Safety Advocacy · 2023-10-21T01:44:01.846Z · LW · GW

I think there’s an aesthetic clash here somewhere. I have an intuition or like... an aesthetic impulse, telling me basically… “advocacy is dumb”. Whenever I see anybody Doing An Activism, they're usually… saying a bunch of... obviously false things? They're holding a sign with a slogan that's too simple to possibly be the truth, and yelling this obviously oversimplified thing as loudly as they possibly can? It feels like the archetype of overconfidence.

This is exactly the same thing that I have felt in the past. Extremely well said. It is worth pointing out explicitly that this is not a rational thought - it's an Ugh Field around advocacy, and even if the thought is true, that doesn't mean all advocacy has to be this way.

Comment by Jay Bailey on Boost your productivity, happiness and health with this one weird trick · 2023-10-20T04:42:06.934Z · LW · GW

I find this interesting but confusing. Do you have an idea for what mechanism allowed this? E.g: Are you getting more done per hour now than your best hours working full-time? Did the full-time hours fall off fast at a certain point? Was there only 15 hours a week of useful work for you to do and the rest was mostly padding?

Comment by Jay Bailey on Fertility Roundup #2 · 2023-10-18T04:52:33.471Z · LW · GW

I think this makes a lot of sense. While I think you can make the case for "fertility crisis purely as a means of preventing economic slowdown and increasing innovation" I think your arguments are good that people don't actually often make this argument, and a lot of it does stem from "more people = good".

But I think if you start from "more people = good", you don't actually have motivated reasoning as much as you suspect re: innovation argument. I think it's more that the innovation argument actually does just work if you accept that more people = good. Because if more people = good, that means more people were good before penicillin and then are even more good afterwards, and these two don't actually cancel each other out.

In summary, I don't think that "more people = good" motivates the "Life is generally good to have, actually" argument - I think if anything it's the other way around. People who think life is good tend to be more likely to think it's a moral good to give it to others. The argument doesn't say it's "axiomatically good" to add more people, it's "axiomatically good conditional on life being net positive".

As for understanding why people might feel that way - my best argument is this.

Let's say you could choose to give birth to a child who would be born with a terribly painful and crippling disease. Would it be a bad thing to do that? Many people would say yes.

Now, let's say you could choose to give birth to a child who would live a happy and healthy positive life? Would that be a good thing? It seems that, logically, if giving birth to a child who suffers is bad, giving birth to a child who enjoys life is good.

That, imo, is the best argument for being in favor of more people if you think life is positive.

Note that I don't think this means people should be forced to have kids or that you're a monster for choosing not to, even if those arguments were true. You can save a life for 5k USD after all, and raising a kid yourself takes far more resources than that. Realistically, if my vasectomy makes me a bad person then I'm also a bad person for not donating every spare dollar to the AMF instead of merely 10%, and if that's a "bad person" then the word has no meaning.

Comment by Jay Bailey on Fertility Roundup #2 · 2023-10-17T23:42:52.274Z · LW · GW

Okay, I think I see several of the cruxes here.

Here's my understanding of your viewpoint:

"It's utterly bizarre to worry about fertility. Lack of fertility is not going to be an x-risk anytime soon. We already have too many people and if anything a voluntary population reduction is a good thing in the relative near-term. (i.e, a few decades or so) We've had explosive growth over the last century in terms of population, it's already unstable, why do we want to keep going?"

In a synchronous discussion I would now pause to see if I had your view right. Because that would take too much time in an asynchronous discussion, I'll reply to the imaginary view I have in my head, while hoping it's not too inaccurate. Would welcome corrections.

If this view of yours seems roughly right, here's what I think are the viewpoint differences:

I think people who worry about fertility would agree with you that fertility is not an existential threat.

I think the intrinsic value of having more people is not an important crux - it is possible to have your view on Point 3 and still worry about fertility.

I think the "fertility crisis" is more about replacement than continued increase. It is possible that many of the people who worry about fertility would also welcome still more people, but I don't think they would consider it a crisis if we were only at replacement rates, or close to it.

I think people who care about speed of innovation don't just care about imposed population deadlines looming, but also about quality of life - if we had invented penicillin a century earlier, many people would have lived much longer, happier lives, for example. One could frame technological progress as a moral imperative this way. I'm not sure if this is a major crux, but I think there are people with a general "More people = good" viewpoint for this reason, even ignoring population ethics. You are right that we could use the people we have better, but I don't see this as a mutually exclusive situation.

I think the people who worry about the fertility crisis would disagree with you about Point 4. I don't think it's obvious that "tech to deal with an older population" is actually easier than "tech to deal with a larger population". It might be! Might not be.

While you may not agree with these ideas, I hope I've presented them reasonably and accurately enough that it makes the other side merely different, rather than bizarre and impossible to understand.

Comment by Jay Bailey on Fertility Roundup #2 · 2023-10-17T15:11:46.440Z · LW · GW

I would suggest responding with your points (Top 3-5, if you have too many to easily list) on why this is incredibly obviously not a problem, seeing where you get pushback if anywhere, and iterating from there. Don't be afraid to point out "incredibly obvious" things - it might not be incredibly obvious to other people. And if you're genuinely unsure why anyone could think this is a problem, the responses to your incredibly obvious points should give you a better idea.

Comment by Jay Bailey on Truthseeking when your disagreements lie in moral philosophy · 2023-10-11T09:10:14.367Z · LW · GW

I think Tristan is totally right, and it puts an intuition I've had into words. I'm not vegan - I am sympathetic to the idea of having this deep emotional dislike of eating animals, I feel like the version of me who has this is a better person, and I don't have it. From a utilitarian perspective I could easily justify just donating a few bucks to animal charities...but veganism isn't about being optimally utilitarian. I see it as more of a virtue ethics thing. It's not even so much that I want to be vegan, but I want to be the kind of person who chooses it. But I'm not sufficiently good of a person to actually do it, which does make me feel somewhat guilty at times. As a salve to my conscience, I've recently decided to try giving up chicken entirely, which seems like a solid step forward that is still pretty easy to make.

Comment by Jay Bailey on Jay Bailey's Shortform · 2023-09-27T01:09:56.417Z · LW · GW

One of the core problems of AI alignment is that we don't know how to reliably get goals into the AI - there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.

Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.

These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won't we get some weird thing that happens to correlate with "don't die, keep your options open" on the training data, but falls apart out of distribution?

Comment by Jay Bailey on GPT-4 for personal productivity: online distraction blocker · 2023-09-27T00:54:24.796Z · LW · GW

I found an error in the application - when removing the last item from the blacklist, every page not whitelisted is claimed to be blacklisted. Adding an item back to the blacklist fixes this. Other than that, it looks good!

Comment by Jay Bailey on Understanding strategic deception and deceptive alignment · 2023-09-26T08:28:23.907Z · LW · GW

Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn't finetuned, if there's one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a "helpful, harmless, honest" directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia.

Does the model then:

  • Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis)
  • Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way - this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI's)

One hard part is that it's difficult to disentangle "Competently lies about the location of Canada" and "Actually believes, insomuch as a language model believes anything, that Canada is in Asia now", but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.

Comment by Jay Bailey on Understanding strategic deception and deceptive alignment · 2023-09-26T04:24:40.647Z · LW · GW

In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest.

According to my understanding of RLHF, the goal-approximation it trains for is "Write a prompt that is likely to be rated as positive". In ChatGPT / Claude, this is indeed highly correlated with being helpful, harmless, and honest, since the model's best strategy for getting high ratings is to be those things. If models are smarter than us, this may cease to be the case, as being maximally honest may begin to conflict with the real goal of getting a positive rating. (e.g, if the model knows something the raters don't, it will be penalised for telling the truth, which may optimise for deceptive qualities) Does this seem right?

Comment by Jay Bailey on Would You Work Harder In The Least Convenient Possible World? · 2023-09-23T02:29:22.262Z · LW · GW

I don't really understand how your central point applies here. The idea of "money saves lives" is not supposed to be a general rule of society, but rather a local point about Alice and Bob - namely, donating ~5k will save a life. That doesn't need to be always true under all circumstances, there just needs to be some repeatable action that Alice and Bob can take (e.g, donating to the AMF) that costs 5k for them that reliably results in a life being saved. (Your point about prolonging life is true, but since the people dying of malaria are generally under 5, the amount of QALY's produced is pretty close to an entire human lifetime)

It doesn't really matter, for the rest of the argument, how this causal relationship works. It could be that donating 5k causes more bednets to be distributed, it could be that donating 5k allows for effective lobbying to improve economic growth to the value of one life, or it could be that the money is burnt in a sacrificial pyre to the God of Charitable Sacrifices, who then descends from the heavens and miraculously cures a child dying of malaria. From the point of view of Alice and Bob, the mechanism isn't important if you're talking on the level of individual donations.

In other words, Alice and Bob are talking on the margins here, and on the margin, 5k spent equals one live saved, at least for now.

Comment by Jay Bailey on MakoYass's Shortform · 2023-08-27T06:27:35.472Z · LW · GW

Not quite, in my opinion. In practice, humans tend to be wrong in predictable ways (what we call a "bias") and so picking the best option isn't easy.

What we call "rationality" tends to be the techniques / thought patterns that make us more likely to pick the best option when comparing alternatives.

Comment by Jay Bailey on [deleted post] 2023-08-25T08:55:36.134Z

How about "AI-assisted post"? Shouldn't clash with anything else, and should be clear what it means on seeing the tag.

Comment by Jay Bailey on All AGI Safety questions welcome (especially basic ones) [July 2023] · 2023-08-04T11:41:30.973Z · LW · GW

"Reward" in the context of reinforcement learning is the "goal" we're training the program to maximise, rather than a literal dopamine hit. For instance, AlphaGo's reward is winning games of Go. When it wins a game, it adjusts itself to do more of what won it the game, and the other way when it loses. It's less like the reward a human gets from eating ice-cream, and more like the feedback a coach might give you on your tennis swing that lets you adjust and make better shots. We have no reason to suspect there's any human analogue to feeling good.

Comment by Jay Bailey on NAMSI: A promising approach to alignment · 2023-07-30T13:10:27.305Z · LW · GW

I think intelligence is a lot easier than morality, here. There are agreed upon moral principles like not lying, not stealing, and not hurting others, sure...but even those aren't always stable across time. For instance, standard Western morality held that it was acceptable to hit your children a couple of generations ago, now standard Western morality says it's not. If an AI trained to be moral said that actually, hitting children in some circumstances is a worthwhile tradeoff, that could mean that the AI is more moral than we are and we overcorrected, or it could mean that the AI is less moral than we are and is simply wrong.

And that's just for the same values! What about how values change over the decades? If our moral AI says that a Confucianism obeying of parental authority is just, and that us Westerners are actually wrong about this, how do we know whether it's correct?

Intelligence tests tend to have a quick feedback loop. The answer is right or wrong. If a Go-playing AI makes a move that looks bizarre but then wins the game, that's indicative that it's superior. Morality is more like long-term planning - if a policy-making AI suggests a strange policy, we have no immediate way to judge whether this is good or not, because we don't have access to the ground truth of whether or not it works for a long time.

Similar with alignment. How do we know that a superhuman alignment solution would look reasonable to us instead of weird? (Also, for that matter, why would a more moral agent have better alignment solutions? Do you think that the blocker for good alignment solutions are that current alignment researchers are insufficiently virtuous to come up with correct solutions?)

Comment by Jay Bailey on NAMSI: A promising approach to alignment · 2023-07-29T12:31:30.990Z · LW · GW

If NAMSI achieved a superhuman level of expertise in morality, how would we know? I consider our society to be morally superior to the one we had in 1960. People in 1960 would not agree with this assessment upon looking. If NAMSI agrees with us about everything, it's not superhuman. So how do we determine whether its possibly-superhuman morality is superior or inferior?

Comment by Jay Bailey on The Weight of the Future (Why The Apocalypse Can Be A Relief) · 2023-06-27T23:10:20.615Z · LW · GW

That may explain why these scenarios have never been all that appealing to me, because I do think about the future in these hypothetical scenarios. I ask myself "Okay, what would the plan be in five years, when the scavenged food has long since run out?" and that feels scary and overwhelming. (Admittedly, rollercoaster scary, since it's a fantasy, but I find myself spending just as much time asking how the hell I'd learn to recreate agriculture and how miserable day-to-day farming would be as I do imagining myself as a badass hero who saves someone from zombies - and that's assuming I survive at all, which is a pretty big if!)

Comment by Jay Bailey on Why "AI alignment" would better be renamed into "Artificial Intention research" · 2023-06-16T02:27:39.218Z · LW · GW

""AI alignment" has the application, the agenda, less charitably the activism, right in the name."

This seems like a feature, not a bug. "AI alignment" is not a neutral idea. We're not just researching how these models behave or how minds might be built neutrally out of pure scientific curiosity. It has a specific purpose in mind - to align AI's. Why would we not want this agenda to be part of the name?

Comment by Jay Bailey on NicholasKross's Shortform · 2023-06-06T23:29:32.368Z · LW · GW

What are the best ones you've got?

Comment by Jay Bailey on What's the consensus on porn? · 2023-05-31T05:59:10.303Z · LW · GW

I don't think this is a good metric. It is very plausible that porn is net bad, but living under the type of govermnment that would outlaw it is worse. In which case your best bet would be to support its legality but avoid it yourself.

I'm not saying that IS the case, but it certainly could be. I definitely think there are plenty of things that are net-negative to society but nowhere near bad enough to outlaw.

Comment by Jay Bailey on All AGI Safety questions welcome (especially basic ones) [May 2023] · 2023-05-09T00:53:13.894Z · LW · GW

An AGI that can answer questions accurately, such as "What would this agentic AGI do in this situation" will, if powerful enough, learn what agency is by default since this is useful to predict such things. So you can't just train an AGI with little agency. You would need to do one of:

  • Train the AGI with the capabilities of agency, and train it not to use them for anything other than answering questions.
  • Train the AGI such that it did not develop agency despite being pushed by gradient descent to do so, and accept the loss in performance.

Both of these seem like difficult problems - if we could solve either (especially the first) this would be a very useful thing, but the first especially seems like a big part of the problem already.

Comment by Jay Bailey on Ngo and Yudkowsky on scientific reasoning and pivotal acts · 2023-04-28T02:05:08.851Z · LW · GW

Late response but I figure people will continue to read these posts over time: Wedding-cake multiplication is the way they teach multiplication in elementary school. i.e, to multiply 706 x 265, you do 706 x 5, then 706 x 60, then 706 x 200 and add all the results together. I imagine it is called that because the result is tiered like a wedding cake.

Comment by Jay Bailey on What Piles Up Must Pile Down · 2023-04-09T22:38:42.498Z · LW · GW

One of the easiest ways to automate this is to have some sort of setup where you are not allowed to let things grow past a certain threshold, a threshold which is immediately obvious and ideally has some physical or digital prevention mechanism attached.

Examples:

Set up a Chrome extension that doesn't let you have more than 10 tabs at a time. (I did this)

Have some number of drawers / closet space. If your clothes cannot fit into this space, you're not allowed to keep them. If you buy something new, something else has to come out.

Comment by Jay Bailey on Parable of the Dammed · 2023-04-05T22:35:22.915Z · LW · GW

I know this is two years later, but I just wanted to say thank you for this comment. It is clear, correct, and well-written, and if I had seen this comment when it was written, it could have saved me a lot of problems at the time.

I've now resolved this issue to my satisfaction, but once bitten twice shy, so I'll try to remember this if it happens again!

Comment by Jay Bailey on Deep Deceptiveness · 2023-04-03T08:42:12.344Z · LW · GW

Sorry it took me a while to get to this.

Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever your actual job is. You are unlikely to learn security this way, and if you had a way to press a button and instantly "pass" the course, you would.

I have in fact made a divide between some things and some other things, in my above post. I suppose I would call those things "goals" (the things you really want for their own sake) and "conditions" (the things you need to do for some external reason)

My inner MIRI says - we can only train conditions into the AI, not goals. We have no idea how to put a goal in the AI, and the problem is that if you train a very smart system with conditions only, and it picks up some arbitrary goal along the way, you end up not getting what you wanted. It seems that if we could get the AI to care about corrigibility and non-deception robustly, at the goal level, we would have solved a lot of the problem that MIRI is worried about.

Comment by Jay Bailey on Deep folding docs site? · 2023-03-28T09:22:50.777Z · LW · GW

Are you thinking of Dynalist? I know Neel Nanda's interpretability explainer is written in it.

Comment by Jay Bailey on Steelmanning OpenAI's Short-Timelines Slow-Takeoff Goal · 2023-03-27T22:04:16.626Z · LW · GW

I think what the OP was saying was that in, say, 2013, there's no way we could have predicted the type of agent that LLM's are and that they would be the most powerful AI's available. So, nobody was saying "What if we get to the 2020s and it turns out all the powerful AI are LLM's?" back then. Therefore, that raises a question on the value of the alignment work done before then.

If we extend that to the future, we would expect most good alignment research to happen within a few years of AGI, when it becomes clear what type of agent we're going to get. Alignment research is much harder if, ten years from now, the thing that becomes AGI is as unexpected to us as LLM's were ten years ago.

Thus, there's not really that much difference, goes the argument, if we get AGI in five years with LLM's or fifteen years with God only knows what, since it's the last few years that matters.

A hardware overhang, on the other hand, would be bad. Imagine we had 2030's hardware when LLM's came onto the scene. You'd have Vaswani et al. coming out in 2032 and by 2035 you'd have GPT-8. That would be terrible.

Therefore, says the steelman, the best scenario is if we are currently in a slow takeoff that gives us time. Hardware overhang is never going to be lower again than it is now, and that ensures we are bumping up against not only conceptual understanding or engineering requirements but also the fundamental limits of compute, which limits how fast we can scale the LLM paradigm. This may not happen if we get a new type of agent in ten years.

Comment by Jay Bailey on Deep Deceptiveness · 2023-03-26T13:18:28.876Z · LW · GW

We don't. Humans lie constantly when we can get away with it. It is generally expected in society that humans will lie to preserve people's feelings, lie to avoid awkwardness, and commit small deceptions for personal gain (though this third one is less often said out loud). Some humans do much worse than this.

What keeps it in check is that very few humans have the ability to destroy large parts of the world, and no human has the ability to destroy everyone else in the world and still have a world where they can survive and optimally pursue their goals afterwards. If there is no plan that can achieve this for a human, humans being able to lie doesn't make it worse.

Comment by Jay Bailey on Deep Deceptiveness · 2023-03-26T07:14:45.440Z · LW · GW

I think this post, more than anything else, has helped me understand the set of things MIRI is getting at. (Though, to be fair, I've also been going through the 2021 MIRI dialogues, so perhaps that's been building some understanding under the surface)

After a few days reflection on this post, and a couple weeks after reading the first part of the dialogues, this is my current understanding of the model:

In our world, there are certain broadly useful patterns of thought that reliably achieve outcomes. The biggest one here is "optimisation". We can think of this as aiming towards a goal and steering towards it - moving the world closer to what you want it to be. These aren't things we train for - we don't even know how to train for them, or against them. They're just the way the world works. If you want to build a power plant, you need some way of getting energy to turn into electricity. If you want to achieve a task, you need some way of selecting a target and then navigating towards it, whether it be walking across the room to grab a cookie, or creating a Dyson sphere.

With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions". This does not stop it from being an optimisation problem. The AI will then naturally, with no prompting, attempt to find the best path that gets around these limitations. This probably means you end up with a path that gets the AI the things it wanted from the useful properties of deception or non-corrigibility while obeying the strict letter of the law. (After all, if deception / non-corrigibility wasn't useful, if it didn't help achieve something, you would not spontaneously get an agent that did this, without training it to do so) Again, this is an entirely natural thing. The shortest path between two points is a straight line. If you add an obstacle in the way, the shortest path between those two points is now to skirt arbitrarily close to the obstacle. No malice is required, any more than you are maliciously circumventing architects when you walk close to (but not walking into!) walls.

Basically - if the output of an optimisation process is dangerous, it doesn't STOP being dangerous by changing it into a slightly different optimisation process of "Achieve X thing (which is dangerous) without doing Y (which is supposed to trigger on dangerous things)". You just end up getting X through Y' instead, as long as you're still enacting the basic pattern - which you will be, because an AI that can't optimise things can't do anything at all. If you couldn't apply a general optimisation process, you'd be unable to walk across the room and get a cookie, let alone do all the things you do in your day-to-day life. Same with the AI.

I'd be interested in whether someone who understands MIRI's worldview decently well thinks I've gotten this right. I'm holding off on trying to figure out what I think about that worldview for now - I'm still in the understanding phase.

Comment by Jay Bailey on AI Capabilities vs. AI Products · 2023-03-25T22:32:36.067Z · LW · GW

I like this dichotomy. I've been saying for a bit that I don't think "companies that only commercialise existing models and don't do anything that pushes forward the frontier" aren't meaningfully increasing x-risk. This is a long and unwieldy statement - I prefer "AI product companies" as a shorthand.

For a concrete example, I think that working on AI capabilities as an upskilling method for alignment is a bad idea, but working on AI products as an upskilling method for alignment would be fine.

Comment by Jay Bailey on How to convince someone AGI is coming soon? · 2023-03-23T03:01:05.825Z · LW · GW

Based on the language you've used in this post, it seems like you've tried several arguments in succession, none of them have worked, and you're not sure why.

One possibility might be to first focus on understanding his belief as well as possible, and then once you understand his conclusions and why he's reached them, you might have more luck. Maybe taking  a look at Street Epistemology for some tips on this style of inquiry would help.

(It is also worth turning this lens upon yourself, and asking why is it so important to you that your friend believes that AGI is immiment? Then you can decide whether it's worth continuing to try to persuade him.)

Comment by Jay Bailey on The Waluigi Effect (mega-post) · 2023-03-20T22:30:40.027Z · LW · GW

If anyone writes this up I would love to know about it - my local AI safety group is going to be doing a reading + hackathon of this in three weeks, attempting to use the ideas on language models in practice. It would be nice to have this version for a couple of people who aren't experienced with AI who will be attending, though it's hardly gamebreaking for the event if we don't have this.

Comment by Jay Bailey on What do you think is wrong with rationalist culture? · 2023-03-11T11:45:15.080Z · LW · GW

So, I notice that still doesn't answer the actual question of what my probability should actually be. To make things simple, let's assume that, if the sun exploded, I would die instantly. In practice it would have to take at least eight minutes, but as a simplifying assumption, let's assume it's instantaneous.

In the absence of relevant evidence, it seems to me like Laplace's Law of Succession would say the odds of the sun exploding in the next hour is 1/2. But I could also make that argument to say the odds of the sun exploding in the next year is also 1/2, which is nonsensical. So...what's my actual probability, here, if I know nothing about how the sun works except that it has not yet exploded, the sun is very old (which shouldn't matter, if I understand you correctly) and that if it exploded, we would all die?