Posts

Could We Automate AI Alignment Research? 2023-08-10T12:17:05.194Z
An Overview of the AI Safety Funding Situation 2023-07-12T14:54:36.732Z
Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4 2023-03-17T18:34:17.178Z
GPT-4 Predictions 2023-02-17T23:20:24.696Z
Stephen McAleese's Shortform 2023-01-08T21:46:25.888Z
AGI as a Black Swan Event 2022-12-04T23:00:53.802Z
Estimating the Current and Future Number of AI Safety Researchers 2022-09-28T21:11:33.703Z
How Do AI Timelines Affect Existential Risk? 2022-08-29T16:57:44.107Z
Summary of "AGI Ruin: A List of Lethalities" 2022-06-10T22:35:48.500Z

Comments

Comment by Stephen McAleese (stephen-mcaleese) on Estimating the Current and Future Number of AI Safety Researchers · 2024-04-26T09:11:58.500Z · LW · GW

I think I might create a new post using information from this post which covers the new AI alignment landscape.

Comment by Stephen McAleese (stephen-mcaleese) on More people getting into AI safety should do a PhD · 2024-04-14T11:42:52.969Z · LW · GW

I think this section of the post is slightly overstating the opportunity cost of doing a PhD. PhD students typically spend most of their time on research so ideally, they should be doing AI safety research during the PhD (e.g. like Stephen Casper). If the PhD is in an unrelated field or for the sake of upskilling then there is a more significant opportunity cost relative to working directly for an AI safety organization.

Comment by Stephen McAleese (stephen-mcaleese) on The theory of Proximal Policy Optimisation implementations · 2024-04-13T09:50:29.593Z · LW · GW

Thank you for explaining PPO. In the context of AI alignment, it may be worth understanding in detail because it's the core algorithm at the heart of RLHF. I wonder if any of the specific implementation details of PPO or how it's different from other RL algorithms have implications for AI alignment. To learn more about PPO and RLHF, I recommend reading this paper: Secrets of RLHF in Large Language Models Part I: PPO.

Comment by Stephen McAleese (stephen-mcaleese) on LLMs for Alignment Research: a safety priority? · 2024-04-05T19:08:58.192Z · LW · GW

From reading the codebase, it seems to be a LangChain chatbot powered by the default LangChain OpenAI model which is gpt-3.5-turbo-instruct. The announcement blog post also says it's based on gpt-3.5-turbo.

Comment by Stephen McAleese (stephen-mcaleese) on LLMs for Alignment Research: a safety priority? · 2024-04-05T08:43:00.503Z · LW · GW

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data. The AI Safety Chatbot partially solves this problem using retrieval-augmented generation (RAG) on a database of articles from https://aisafety.info. There also seem to be plans to fine-tune it on a dataset of alignment articles.

Comment by Stephen McAleese (stephen-mcaleese) on Reward is not the optimization target · 2024-04-04T22:30:41.927Z · LW · GW

OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.

But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state  and then chooses the action with the highest value.

Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?

Comment by Stephen McAleese (stephen-mcaleese) on Modern Transformers are AGI, and Human-Level · 2024-03-31T11:16:04.758Z · LW · GW

I agree. GPT-4 is an AGI for the kinds of tasks I care about such as programming and writing. ChatGPT4 in its current form (with the ability to write and execute code) seems to be at the expert human level in many technical and quantitative subjects such as statistics and programming.

For example, last year I was amazed when I gave ChatGPT4 one of my statistics past exam papers and it got all the questions right except for one which involved interpreting an image of a linear regression graph. The questions typically involve understanding the question, thinking of an appropriate statistical method, and doing calculations to find the right answer. Here's an example question:

Times (in minutes) for a sample of 8 players are presented in Table 1 below. Using an appropriate test at the 5% significance level, investigate whether there is evidence of a decrease in the players’ mean 5k time after the six weeks of training. State clearly your assumptions and conclusions, and report a p-value for your test statistic.

The solution to this question is a paired sample t-test.

Sure, GPT-4 has probably seen similar questions before but so do students since they can practice past papers.

This year, one of my professors designed his optimization assignment to be ChatGPT-proof but I found that it could still solve five out of six questions successfully. The questions involved converting natural language descriptions of optimization problems into mathematical formulations and solving them with a program.

One of the few times I've seen GPT-4 genuinely struggle to do a task is when I asked it to solve a variant of the Zebra Puzzle which is a challenging logical reasoning puzzle that involves updating a table based on limited information and using logical reasoning and a process of elimination to find the correct answer.

Comment by Stephen McAleese (stephen-mcaleese) on Can we get an AI to do our alignment homework for us? · 2024-03-01T23:29:18.859Z · LW · GW

I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:

  • There's a chicken-and-egg problem where you need the automated alignment researcher to create the alignment solution but the alignment solution is needed before you can safely create the automated alignment researcher. The solution to this dilemma is an iterative bootstrapping process where the AI's capabilities and alignment iteratively improve each other (a more aligned AI can be made more capable and a more capable AI can create a more aligned AI and so on).
  • Creating the automated alignment researcher only makes sense if it is less capable and general than a full-blown AGI. Otherwise, aligning it is just as hard as aligning AGI.

There's no clear answer to this question because it depends on your definition of "AI alignment" work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.

The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.

However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.

I can't find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can't easily be automated so Eliezer's pessimism makes sense in light of that information.

Further reading:

Comment by Stephen McAleese (stephen-mcaleese) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-29T22:42:26.800Z · LW · GW

Strong upvote. I think this is an excellent, carefully written, and timely post. Explaining issues that may arise from current alignment methods is urgent and important. It provides a good explanation of the unidentifiability or inner alignment problem that could arise from advanced AIs systems trained with current behavioral safety methods. It also highlights the difficulty of making AIs that can automate alignment research which is part of OpenAI's current plan. I also liked the in-depth description of what advanced science AIs would be capable of as well as the difficulty of keeping humans in the loop.

Comment by Stephen McAleese (stephen-mcaleese) on Refusal mechanisms: initial experiments with Llama-2-7b-chat · 2024-01-03T17:49:55.009Z · LW · GW

Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.

Comment by Stephen McAleese (stephen-mcaleese) on OpenAI, DeepMind, Anthropic, etc. should shut down. · 2023-12-22T10:33:36.292Z · LW · GW

TL;DR: Private AI companies such as Anthropic which have revenue-generating products and also invest heavily in AI safety seem like the best type of organization for doing AI safety research today. This is not the best option in an ideal world and maybe not in the future but right now I think it is.

I appreciate the idealism and I'm sure there is some possible universe where shutting down these labs would make sense but I'm quite unsure about whether doing so would actually be net-beneficial in our world and I think there's a good chance it would be net-negative in reality.

The most glaring constraint is finances. AI safety is funding-constrained so this is worth mentioning. Companies like DeepMind and OpenAI spend hundreds of millions of dollars per year on staff and compute and I doubt that would be possible in a non-profit. Most of the non-profits working on AI safety (e.g. Redwood Research) are small with just a handful of people. OpenAI changed their company from a non-profit to a capped for-profit because they realized that being a non-profit would have been insufficient for scaling their company and spending. OpenAI now generates $1 billion in revenue and I think it's pretty implausible that a non-profit could generate that amount of income.

The other alternative apart from for-profit companies and philanthropic donations is government funding. It is true that governments fund a lot of science. For example, the US government funds 40% of basic science research. And a lot of successful big science projects such as CERN and the ITER fusion project seem to be mostly government-funded. However, I would expect a lot of government-funded academic AI safety grants to be wasted by professors skilled at putting "AI safety" in their grant applications so that they can fund whatever they were going to work on anyway. Also, the fact that the US government has secured voluntary commitments from AI labs to build AI safely gives me the impression that governments are either unwilling or incapable of working on AI safety and instead would prefer to delegate it to private companies. On the other hand, the UK has a new AI safety institute and a language model task force.

Another key point is research quality. In my opinion, the best AI safety research is done by the big labs. For example, Anthropic created constitutional AI and they also seem to be a leader in interpretability research. I think empirical AI safety work and AI capabilities work involve very similar skills (coding etc.) and therefore it's not surprising that leading AI labs also do the best empirical AI safety work. There are several other reasons for explaining why big AI labs do the best empirical AI safety work. One is talent. Top labs have the money to pay high salaries which attracts top talent. Work in big labs also seems more collaborative than in academia which seems important for large projects. Many top projects have dozens of authors (e.g. the Llama 2 paper). Finally, there is compute. Right now, only big labs have the infrastructure necessary to do experiments on leading models. Doing experiments such as fine-tuning large models requires a lot of money and hardware. For example, this paper by DeepMind on reducing sycophancy apparently involved fine-tuning the 540B PaLM model which is probably not possible for most independent and academic researchers right now and consequently, they usually have to work with smaller models such as Llama-2-7b. However, the UK is investing in some new public AI supercomputers which hopefully will level the playing field somewhat. If you think theoretical work (e.g. agent foundations) is more important than empirical work then big labs have less of an advantage. Though DeepMind is doing some of that too.

Comment by Stephen McAleese (stephen-mcaleese) on Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" · 2023-11-21T22:58:56.311Z · LW · GW

GPT-4 is the model that has been trained with the most training compute which suggests that compute is the most important factor for capabilities. If that wasn't true, we would see some other company training models with more compute but worse performance which doesn't seem to be happening.

Comment by Stephen McAleese (stephen-mcaleese) on The other side of the tidal wave · 2023-11-03T22:31:15.088Z · LW · GW

No offense but I sense status quo bias in this post.

If you replace "AI" with "industrial revolution" I don't think the meaning of the text changes much and I expect most people would rather live today than in the Middle Ages.

One thing that might be concerning is that older generations (us in the future) might not have the ability to adapt to a drastically different world in the same way that some old people today struggle to use the internet.

I personally don't expect to be overly nostalgic in the future because I'm not that impressed by the current state of the world: factory farming, the hedonic treadmill, physical and mental illness, wage slavery, aging, and ignorance are all problems that I hope are solved in the future.

Comment by Stephen McAleese (stephen-mcaleese) on My thoughts on the social response to AI risk · 2023-11-02T22:56:04.560Z · LW · GW

Although AI progress is occurring gradually right now where regulation can keep up, I do think a hard takeoff is still a possibility.

My understanding is that fast recursive self-improvement occurs once there is a closed loop of fully autonomous self-improving AI. AI is not capable enough for that yet and most of the important aspects of AI research are still done by humans but it could become a possibility in the future once AI agents are advanced and reliable enough.

In the future before an intelligence explosion, there could be a lot of regulation and awareness of AI relative to today. But if there's a fast takeoff, regulation would be unable to keep up with AI progress.

Comment by Stephen McAleese (stephen-mcaleese) on Intelligence Enhancement (Monthly Thread) 13 Oct 2023 · 2023-10-13T21:17:14.540Z · LW · GW

Recently I learned that the negative effect of sleep deprivation on cognitive performance seems to accumulate over several days. Five days of insufficient sleep can lower cognitive performance by up to 15 IQ points according to this source.

Comment by Stephen McAleese (stephen-mcaleese) on What's your standard for good work performance? · 2023-09-27T20:05:21.713Z · LW · GW

I personally use Toggl to track how much time I spend working per day. I usually aim for at least four hours of focused work per day.

Comment by Stephen McAleese (stephen-mcaleese) on There should be more AI safety orgs · 2023-09-22T12:12:34.949Z · LW · GW

Thanks for the post! I think it does a good job of describing key challenges in AI field-building and funding.

The talent gap section describes a lack of positions in industry organizations and independent research groups such as SERI MATS. However, there doesn't seem to be much content on the state of academic AI safety research groups. So I'd like to emphasize the current and potential importance of academia for doing AI safety research and absorbing talent. The 80,000 Hours AI risk page says that there are several academic groups working on AI safety including the Algorithmic Alignment Group at MIT, CHAI in Berkeley, the NYU Alignment Research Group, and David Krueger's group in Cambridge.

The AI field as a whole is already much larger than the AI safety field so I think analyzing the AI field is useful from a field-building perspective. For example, about 60,000 researchers attended AI conferences worldwide in 2022. There's an excellent report on the state of AI research called Measuring Trends in Artificial Intelligence. The report says that most AI publications come from the 'education' sector which is probably mostly universities. 75% of AI publications come from the education sector and the rest are published by non-profits, industry, and governments. Surprisingly, the top 9 institutions by annual AI publication count are all Chinese universities and MIT is in 10th place. Though the US and industry are still far ahead in 'significant' or state-of-the-art ML systems such as PaLM and GPT-4.

What about the demographics of AI conference attendees? At NeurIPS 2021, the top institutions by publication count were Google, Stanford, MIT, CMU, UC Berkeley, and Microsoft which shows that both industry and academia play a large role in publishing papers at AI conferences.

Another way to get an idea of where people work in the AI field is to find out where AI PhD students go after graduating in the US. The number of AI PhD students going to industry jobs has increased over the past several years and 65% of PhD students now go into industry but 28% still go into academic jobs.

Only a few academic groups seem to be working on AI safety and many of the groups working on it are at highly selective universities but AI safety could become more popular in academia in the near future. And if the breakdown of contributions and demographics of AI safety will be like AI in general, then we should expect academia to play a major role in AI safety in the future. Long-term AI safety may actually be more academic than AI since universities are the largest contributor to basic research whereas industry is the largest contributor to applied research.

So in addition to founding an industry org or facilitating independent research, another path to field-building is to increase the representation of AI safety in academia by founding a new research group though this path may only be tractable for professors.

Comment by Stephen McAleese (stephen-mcaleese) on AI romantic partners will harm society if they go unregulated · 2023-09-06T12:58:05.654Z · LW · GW

Thanks for the post. It's great that people are discussing some of the less-frequently discussed potential impacts of AI.

I think a good example to bring up here is video games which seem to have similar risks. 

When you think about it, video games seem just as compelling as AI romantic partners. Many video games such as Call of Duty, Civilization, or League of Legends involve achieving virtual goals, leveling up, and improving skills in a way that's often more fulfilling than real life. Realistic 3D video games have been widespread since the 2000s but I don't think they have negatively impacted society all that much. Though some articles claim that video games are having a significant negative effect on young men.

Personally, I've spent quite a lot of time playing video games during my childhood and teenage years but I mostly stopped playing them once I went to college. But why replace an easy and fun way to achieve things with reality which is usually less rewarding and more frustrating? My answer is that achievements in reality are usually much more real, persistent, and valuable than achievements in video games. You can achieve a lot in video games but it's unlikely that you'll achieve goals that increase your status to as many people over a long period of time as you can in real life.

A relevant quote from the article I linked above:

"After a while I realized that becoming master of a fake world was not worth the dozens of hours a month it was costing me, and with profound regret I stashed my floppy disk of “Civilization” in a box and pushed it deep into my closet. I hope I never get addicted to anything like “Civilization” again."

Similarly, in the near term at least, AI romantic partners could be competitive with real relationships in the short term, but I doubt it will be possible to have AI relationships that are as fulfilling and realistic as a marriage that lasts several decades.

And as with the case of video games, status will probably favour real relationships causing people to value real relationships because they offer more status than virtual ones. One possible reason is that status depends on scarcity. Just as being a real billionaire offers much more status than being a virtual one, having a real high-quality romantic partner will probably yield much more status than a virtual one and as a result, people will be motivated to have real partners.

Comment by Stephen McAleese (stephen-mcaleese) on Could We Automate AI Alignment Research? · 2023-08-28T21:58:14.049Z · LW · GW

Some related posts on automating alignment research I discovered recently:

Comment by Stephen McAleese (stephen-mcaleese) on Could We Automate AI Alignment Research? · 2023-08-28T21:42:27.552Z · LW · GW

I agree that the difficulty of the alignment problem can be thought of as a diagonal line on the 2D chart above as you described.

This model may make having two axes instead of one unnecessary. If capabilities and alignment scale together predictably, then high alignment difficulty is associated with high capabilities, and therefore the capabilities axis could be unnecessary.

But I think there's value in having two axes. Another way to think about your AI alignment difficulty scale is like a vertical line in the 2D chart: for a given level of AI capability (e.g. pivotal AGI), there is uncertainty about how hard it would be to align such an AGI because the gradient of the diagonal line intersecting the vertical line is uncertain.

Instead of a single diagonal line, I now think the 2D model describes alignment difficulty in terms of the gradient of the line. An optimistic scenario is one where AI capabilities are scaled and few additional alignment problems arise or existing alignment problems do not become more severe because more capable AIs naturally follow human instructions and learn complex values. A highly optimistic possibility is that increased capabilities and alignment are almost perfectly correlated and arbitrarily capable AIs are no more difficult to align than current systems. Easy worlds correspond to lines in the 2D chart with low gradients and low-gradient lines intersect the vertical line corresponding to the 1D scale at a low point.

A pessimistic scenario can be represented in the chart as a steep line where alignment problems rapidly crop up as capabilities are increased. For example, in such hard worlds, increased capabilities could make deception and self-preservation much more likely to arise in AIs. Problems like goal misgeneralization might persist or worsen even in highly capable systems. Therefore, in hard worlds, AI alignment difficulty increases rapidly with capabilities and increased capabilities do not have helpful side effects such as the formation of natural abstrations that could curtail the increasing difficulty of the AI alignment problem. In hard worlds, since AI capabilities gains cause a rapid increase in alignment difficulty, the only way to ensure that alignment research keeps up with the rapidly increasing difficulty of the alignment problem is to limit progress in AI capabilities.

Comment by stephen-mcaleese on [deleted post] 2023-08-21T17:32:48.243Z

What you're describing above sounds like an aligned AI and I agree that convergence to the best-possible values over time seems like something an aligned AI would do.

But I think you're mixing up intelligence and values. Sure, maybe an ASI would converge on useful concepts in a way similar to humans. For example, AlphaZero rediscovered some human chess concepts. But because of the orthogonality thesis, intelligence and goals are more or less independent: you can increase the intelligence of a system without its goals changing.

The classic thought experiment illustrating this is Bostrom's paperclip maximizer which continues to value only paperclips even when it becomes superintelligent.

Also, I don't think neuromorphic AI would reliably lead to an aligned AI. Maybe an exact whole-brain emulation of some benevolent human would be aligned but otherwise, a neuromorphic AI could have a wide variety of possible goals and most of them wouldn't be aligned.

I suggest reading The Superintelligent Will to understand these concepts better.

Comment by Stephen McAleese (stephen-mcaleese) on Try to solve the hard parts of the alignment problem · 2023-08-19T14:45:05.158Z · LW · GW

If you don’t know where you’re going, it’s not helpful enough not to go somewhere that’s definitely not where you want to end up; you have to differentiate paths towards the destination from all other paths, or you fail.

I'm not exactly sure what you meant here but I don't think this claim is true in the case of RLHF because, in RLHF, labelers only need to choose which option is better or worse between two possibilities, and these choices are then used to train the reward model. A binary feedback style was chosen specifically because it's usually too difficult for labelers to choose between multiple options.

A similar idea is comparison sorting where the algorithms only need the ability to compare two numbers at a time to sort a list of numbers.

Comment by Stephen McAleese (stephen-mcaleese) on Could We Automate AI Alignment Research? · 2023-08-19T14:31:11.271Z · LW · GW

Thanks for the comment.

I think there's a possibility that there could be dangerous emergent dynamics from multiple interacting AIs but I'm not too worried about that problem because I don't think you can increase the capabilities of an AI much simply by running multiple copies of it. You can do more work this way but I don't think you can get qualitatively much better work.

OpenAI created GPT-4 by training a brand new model not by running multiple copies of GPT-3 together. Similarly, although human corporations can achieve more than a single person, I don't consider them to be superintelligent. I'd say GPT-4 is more capable and dangerous than 10 copies of GPT-3.

I think there's more evidence that emergent properties come from within the AI model itself and therefore I'm more worried about bigger models than problems that would occur from running many of them. If we could solve a task using multiple AIs rather than one highly capable AI, I think that would probably be safer and I think that's part of the idea behind iterated amplification and distillation.

There's value in running multiple AIs. For example, OpenAI used multiple AIs to summarize books recursively. But even if we don't run multiple AI models, I think a single AI running at high speed would also be highly valuable. For example, you can paste a long text into GPT-4 today and it will summarize it in less than a minute.

Comment by Stephen McAleese (stephen-mcaleese) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T10:25:44.565Z · LW · GW

In my opinion, much of the value of interpretability is not related to AI alignment but to AI capabilities evaluations instead.

For example, the Othello paper shows that a transformer trained on the next-word prediction of Othello moves learns a world model of the board rather than just statistics of the training text. This knowledge is useful because it suggests that transformer language models are more capable than they might initially seem.

Comment by Stephen McAleese (stephen-mcaleese) on AGI is easier than robotaxis · 2023-08-13T20:22:28.861Z · LW · GW

I highly recommend this interview with Yann LeCun which describes his view on self-driving cars and AGI.

Basically, he thinks that self-driving cars are possible with today's AI but would require immense amounts of engineering (e.g. hard-wired behavior for corner cases) because today's AI (e.g. CNNs) tends to be brittle and lacks an understanding of the world.

My understanding is that Yann thinks we basically need AGI to solve autonomous driving in a reliable and satisfying way because the car would need to understand the world like a human to drive reliably.

Comment by Stephen McAleese (stephen-mcaleese) on Could We Automate AI Alignment Research? · 2023-08-12T13:51:12.375Z · LW · GW

I think it's probably true today that LLMs are better at doing subtasks than doing work over long time horizons.

On the other hand, I think human researchers are also quite capable of incrementally advancing their own research agendas and there's a possibility that AI could be much more creative than humans at seeing patterns in a huge corpus of input text and coming up with new ideas from that.

Comment by Stephen McAleese (stephen-mcaleese) on What's A "Market"? · 2023-08-12T13:46:33.980Z · LW · GW

Thanks for the post. I like how it gives several examples and then aims to find what's in common between them.

Recently I've been thinking that research can be seen as a kind of market where researchers specialize in research they have a comparative advantage in and trade insights by publishing and reading other researchers' work.

Comment by Stephen McAleese (stephen-mcaleese) on Inflection.ai is a major AGI lab · 2023-08-10T13:57:42.505Z · LW · GW

10,000 teraFLOPS

Each H100 will be closer to 1,000 teraFLOPs or less. For reference, the A100 generally produces 150 teraFLOPs in real-world systems.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-08-07T13:28:57.553Z · LW · GW

Today I received a rejection email from Oliver Hybryka on behalf of Lightspeed Grants. It says Lightspeed Grants received 600 applicants, ~$150M in default funding requests, and ~$350M in maximum funding requests. Since the original amount to be distributed was $5M, only ~3% of applications could be funded.

I knew they received more applicants than they could fund but I'm surprised by how much was requested and how large the gap was.

Comment by Stephen McAleese (stephen-mcaleese) on The Control Problem: Unsolved or Unsolvable? · 2023-08-02T20:25:50.561Z · LW · GW

I meant that I see most humans as aligned with human values such as happiness and avoiding suffering. The point I'm trying to make is that human minds are able to represent these concepts internally and act on them in a robust way and therefore it seems possible in principle that AIs could too.

I'm not sure whether humans are aligned with evolution. Many humans do want children but I don't think many are fitness maximizes where they want as many as possible.

Comment by Stephen McAleese (stephen-mcaleese) on The Control Problem: Unsolved or Unsolvable? · 2023-08-02T19:31:07.231Z · LW · GW

I think AI alignment is solvable for the same reason AGI is solvable: humans are an existence-proof for both alignment and general intelligence.

Comment by Stephen McAleese (stephen-mcaleese) on AXRP Episode 24 - Superalignment with Jan Leike · 2023-08-01T10:33:29.319Z · LW · GW

My summary of the podcast

Introduction

The superalignment team is OpenAI’s new team, co-led by Jan Leike and Ilya Sutskever, for solving alignment for superintelligent AI. One of their goals is to create a roughly human-level automated alignment researcher. The idea is that creating an automated alignment researcher is a kind of minimum viable alignment project where aligning it is much easier than aligning a more advanced superintelligent sovereign-style system. If the automated alignment researcher creates a solution to aligning superintelligence, OpenAI can use that solution for aligning superintelligence.

The automated alignment researcher

The automated alignment researcher is expected to be used for running and evaluating ML experiments, suggesting research directions, and helping with explaining conceptual ideas. The automated researcher needs two components: a model capable enough to do alignment research, which will probably be some kind of advanced language model, and alignment for the model. Initially, they’ll probably start with relatively weak systems with weak alignment methods like RLHF and then scale up both using a bootstrapping approach where the model increases alignment and then the model can be scaled as it becomes more aligned. The end goal is to be able to convert compute into alignment research so that alignment research can be accelerated drastically. For example, if 99% of tasks were automated, research would be ~100 times faster.

The superalignment team

The automated alignment researcher will be built by the superalignment team which currently has 20 people and could have 30 people by the end of the year. OpenAI also plans to allocate 20% of its compute to the superalignment team. Apart from creating the automated alignment researcher, the superalignment team will continue doing research on feedback-based approaches and scalable oversight. Jan emphasizes that the superalignment team will still be needed even if there is an automated alignment researcher because he wants to keep humans in the loop. He also wants to avoid the risk of creating models that seek power, self-improve, deceive human overseers, or exfiltrate (escape).

Why Jan is optimistic

Jan is generally optimistic about the plan succeeding and estimates that it has an ~80% chance of succeeding even though Manifold only gives the project a 22% chance of success. He gives 5 reasons for being optimistic:

  • LLMs understand human intentions and morality much better than other kinds of agents such as RL game-playing agents. For example, often you can simply ask them to behave a certain way.
  • Seeing how well RLHF worked. For example, training agents to play Atari games works almost as well as using the reward signal. RLHF-aligned LLMs are much more aligned than base models.
  • It’s possible to iterate and improve alignment solutions using experiments and randomized controlled trials.
  • Evaluating research is easier than generating it.
  • The last reason is a bet on language models. Jan thinks many alignment tasks can be formulated as text-in-text-out tasks.

Controversial statements/criticisms

Jan criticizing interpretability research:

I think interpretability is neither necessary, nor sufficient. I think there is a good chance that we could solve alignment purely behaviorally without actually understanding the models internally. And I think, also, it’s not sufficient where: if you solved interpretability, I don’t really have a good story of how that would solve superintelligence alignment, but I also think that any amount of non-trivial insight we can gain from interpretability will be super useful or could potentially be super useful because it gives us an avenue of attack.

Jan criticizing alignment theory research:

I think there’s actually a lot more scope for theory work than people are currently doing. And so I think for example, scalable oversight is actually a domain where you can do meaningful theory work, and you can say non-trivial things. I think generalization is probably also something where you can say… formally using math, you can make statements about what’s going on (although I think in a somewhat more limited sense). And I think historically there’s been a whole bunch of theory work in the alignment community, but very little was actually targeted at the empirical approaches we tend to be really excited [about] now. And it’s also a lot of… Theoretical work is generally hard because you have to… you’re usually either in the regime where it’s too hard to say anything meaningful, or the result requires a bunch of assumptions that don’t hold in practice. But I would love to see more people just try … And then at the very least, they’ll be good at evaluating the automated alignment researcher trying to do it.

Ideas for complementary research

Jan also gives some ideas that would be complementary to OpenAI’s alignment research agenda:

  • Creating mechanisms for eliciting values from society.
  • Solving current problems like hallucinations, jailbreaking, or mode-collapse (repetition) in RLHF-trained models.
  • Improving model evaluations and evals to measure capabilities and alignment.
  • Reward model interpretability.
  • Figuring out how models generalize. For example, figuring out how to generalize alignment for easy-to-supervised tasks to hard tasks.
Comment by Stephen McAleese (stephen-mcaleese) on Open Problems and Fundamental Limitations of RLHF · 2023-07-31T21:29:34.779Z · LW · GW

Thanks for writing the paper! I think it will be really impactful and I think it fills a big gap in the literature.

I've always wondered what problems RLHF had and mostly I've seen only short informal answers about how it incentivizes deception or how humans can't provide a scalable signal for superhuman tasks which is odd because it's one of the most commonly used AI alignment methods.

Before your paper, I think this post was the most in-depth analysis of problems with RLHF I've seen so I think your paper is now probably the best resource for problems with RLHF. Apart from that post, the List of Lethalities post has a few related sections and this post by John Wentworth has a section on RLHF.

I'm sure your paper will spark future research on improving RLHF because it lists several specific discrete problems that could be tackled!

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-28T14:42:02.375Z · LW · GW

At first, I predicted you were going to say that public funding would accelerate capabilities research over alignment but it seems like the gist of your argument is that lots of public funding would muddy the water and sharply reduce the average quality of alignment research.

That might be true for theoretical AI alignment research but I'd imagine it's less of a problem for types of AI alignment research that have decent feedback loops like interpretability research and other kinds of empirical research like experiments on RL agents.

One reason that I'm skeptical is that there doesn't seem to be a similar problem in the field of ML which is huge and largely publicly funded to the best of my knowledge and still makes good progress. Possible reasons why the ML field is still effective despite its size include sufficient empirical feedback loops and the fact that top conferences reject most papers (~25% is a typical acceptance rate for papers at NeurIPS).

Comment by Stephen McAleese (stephen-mcaleese) on Think carefully before calling RL policies "agents" · 2023-07-27T20:41:07.642Z · LW · GW

According to the LessWrong concepts page for agents, an agent is an entity that perceives its environment and takes actions to maximize its utility.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-20T23:10:32.257Z · LW · GW

Context of the post: funding overhang

The post was written in 2021 and argued that there was a funding overhang in longtermist causes (e.g. AI safety) because the amount of funding had grown faster than the number of people working.

The amount of committed capital increased by ~37% per year and the amount of deployed funds increased by ~21% per year since 2015 whereas the number of engaged EAs only grew ~14% per year.

The introduction of the FTX Future Fund around 2022 caused a major increase in longtermist funding which further increased the funding overhang.

Benjamin linked a Twitter update in August 2022 saying that the total committed capital was down by half because of a stock market and crypto crash. Then FTX went bankrupt a few months later.

The current situation

The FTX Future Fund no longer exists and Open Phil AI safety spending seems to have been mostly flat for the past 2 years. The post mentions that Open Phil is doing this to evaluate impact and increase capacity before possibly scaling more.

My understanding (based on this spreadsheet) is that the current level of AI safety funding has been roughly the same for the past 2 years whereas the number of AI safety organizations and researchers has been increasing by ~15% and ~30% per year respectively. So the funding overhang could be gone by now or there could even be a funding underhang.

Comparing talent vs funding

The post compares talent and funding in two ways:

  • The lifetime value of a researcher (e.g. $5 million) vs total committed funding (e.g. $1 billion)
  • The annual cost of a researcher (e.g. $100k) vs annual deployed funding (e.g. $100 million)

A funding overhang occurs when the total committed funding is greater than the lifetime value of all the researchers or the annual amount of funding that could be deployed per year is greater than the annual cost of all researchers.

Then the post says:

“Personally, if given the choice between finding an extra person for one of these roles who’s a good fit or someone donating $X million per year, to think the two options were similarly valuable, X would typically need to be over three, and often over 10 (where this hugely depends on fit and the circumstances).”

I forgot to mention that this statement was applied to leadership roles like research leads, entrepreneurs, and grantmakers who can deploy large amounts of funds or have a large impact and therefore can have a large amount of value. Ordinary employees probably have less financial value.

Assuming there is no funding overhang in AI safety anymore, the marginal value of funding over more researchers is higher today than it was when the post was written.

The future

If total AI safety funding does not increase much in the near term, AI safety could continue to be funding-constrained or become more funding constrained as the number of people interested in working on AI safety increases.

However, the post explains some arguments for expecting EA funding to increase:

  • There’s some evidence that Open Philanthropy plans to scale up its spending over the next several years. For example, this post says, “We gave away over $400 million in 2021. We aim to double that number this year, and triple it by 2025”. Though the post was written in 2022 so it could be overoptimistic.
  • According to Metaculus, there is a ~50% chance of another Good Ventures / Open Philanthropy-sized fund being created by 2026 which could substantially increase funding for AI safety.

My mildly optimistic guess is that as AI safety becomes more mainstream there will be a symmetrical effect where both more talent and funding are attracted to the field.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-20T13:02:22.940Z · LW · GW

I have a similar story. I left my job at Amazon this year because there were layoffs there. Also, the release of GPT-4 in March made working on AI safety seem more urgent.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-20T12:11:45.204Z · LW · GW

In this 80,000 Hours post (written in 2021), Benjamin Todd says "I’d typically prefer someone in these roles to an additional person donating $400,000–$4 million per year (again, with huge variance depending on fit)." This seems like an argument against earning to give for most people.

On the other hand, this post emphasizes the value of small donors.
 

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T22:02:11.481Z · LW · GW

That seems like a better split and there are outliers of course. But I think orgs are more likely to be well-known to grant-makers on average given that they tend to have a higher research output, more marketing, and the ability to organize events. An individual is like an organization with one employee.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T21:44:32.593Z · LW · GW

Based on what I've written here, my verdict is that AI safety seems more funding constrained for small projects and individuals than it is for organizations for the following reasons:
- The funds that fund smaller projects such as LTFF tend to have less money than other funds such as Open Phil which seems to be more focused on making larger grants to organizations (Open Phil spends 14x more per year on AI safety).
- Funding could be constrained by the throughput of grant-makers (the number of grants they can make per year). This seems to put funds like LTFF at a disadvantage since they tend to make a larger number of smaller grants so they are more constrained by throughput than the total amount of money available. Low throughput incentivizes making a small number of large grants which favors large existing organizations over smaller projects or individuals.
- Individuals or small projects tend to be less well-known than organizations so grants for them can be harder to evaluate or might be more likely to be rejected. On the other hand, smaller grants are less risky.
- The demand for funding for individuals or small projects seems like it could increase much faster than it could for organizations because new organizations take time to be created (though maybe organizations can be quickly scaled).

Some possible solutions:
- Move more money to smaller funds that tend to make smaller grants. For example, LTFF could ask for more money from Open Phil.
- Hire more grant evaluators or hire full-time grant evaluators so that there is a higher ceiling on the total number of grants that can be made per year.
- Demonstrate that smaller projects or individuals can be as effective as organizations to increase trust.
- Seek more funding: half of LTFF's funds come from direct donations so they could seek more direct donations.
- Existing organizations could hire more individuals rather than the individuals seeking funding themselves.
- Individuals (e.g. independent researchers) could form organizations to reduce the administrative load on grant-makers and increase their credibility.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T20:22:54.473Z · LW · GW

This sounds more or less correct to me. Open Philanthropy (Open Phil) is the largest AI safety grant maker and spent over $70 million on AI safety grants in 2022 whereas LTFF only spent ~$5 million. In 2022, the median Open Phil AI safety grant was $239k whereas the median LTFF AI safety grant was only $19k in 2022.

Open Phil and LTFF made 53 and 135 AI safety grants respectively in 2022. This means the average Open Phil AI safety grant in 2022 was ~$1.3 million whereas the average LTFF AI safety grant was only $38k. So the average Open Phil AI safety grant is ~30 times larger than the average LTFF grant.

These calculations imply that Open Phil and LTFF make a similar number of grants (LTFF actually makes more) and that Open Phil spends much more simply because its grants tend to be much larger (~30x larger). So it seems like funds may be more constrained by their ability to evaluate and fulfill grants rather than having a lack of funding. This is not surprising given that the LTFF grantmakers apparently work part-time.

Counterintuitively, it may be easier for an organization (e.g. Redwood Research) to get a $1 million grant from Open Phil than it is for an individual to get a $10k grant from LTFF. The reason why is that both grants probably require a similar amount of administrative effort and a well-known organization is probably more likely to be trusted to use the money well than an individual so the decision is easier to make. This example illustrates how decision-making and grant-making processes are probably just as important as the total amount of money available.

LTFF specifically could be funding-constrained though given that it only spends ~$5 million per year on AI safety grants. Since ~40% of LTFF's funding comes from Open Phil and Open Phil has much more money than LTFF, one solution is for LTFF to simply ask for more money from Open Phil.

I don't know why Open Phil spends so much more on AI safety than LTFF (~14x more). Maybe it's simply because of some administrative hurdles that LTFF has when requesting money from Open Phil or maybe Open Phil would rather make grants directly.

Here is a spreadsheet comparing how much Open Phil, LTFF, and the Survival and Flourishing Fund (SFF) spend on AI safety per year.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T19:21:36.888Z · LW · GW

Plug: I recently published a long post on the EA Forum on AI safety funding: An Overview of the AI Safety Funding Situation.

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Grantmaking is Funding-Limited Right Now · 2023-07-19T19:11:21.032Z · LW · GW

I created a Guesstimate model that estimates that $2-3 million (range: $25k - $25 million) in high-quality grants could have been requested for the Lightspeed grant ($5 million was available).

Comment by Stephen McAleese (stephen-mcaleese) on Alignment Megaprojects: You're Not Even Trying to Have Ideas · 2023-07-13T11:57:48.772Z · LW · GW

Thanks for the post. I think it's a valuable exercise to think about how AI safety could be accelerated with unlimited money.

I think the Manhattan Project idea is interesting but I see some problems with the analogy:

  • The Manhattan Project was originally a military project and to this day, the military is primarily funded and managed by the government. But most progress in AI today is made by companies such as OpenAI and Google and universities like the University of Toronto. I think a more relevant project is CERN because it's more recent and focused on the non-military development of science.
  • The Manhattan Project happened a long time ago and the world has changed a lot since then. The wealth and influence of tech companies and universities is probably much greater today than it was then.
  • It's not obvious that a highly centralized effort is needed. The Alignment Forum, open source developers, and the academic research community (e.g. the ML research community) are examples of decentralized research communities that seem to be highly effective at making progress. This probably wasn't possible in the past because the internet didn't exist.

I highly doubt that it's possible to recreate the Bay Area culture in a top-down way. I'm pretty sure China has tried this and I don't think they've succeeded.

Also, I think your description is overemphasizing the importance of geniuses like Von Neumann because 130,000 other people worked on the Manhattan Project too. I think something similar has happened at Google today where Jeff Dean is revered but in reality, I think most progress at Google is done by the tens of thousands of the smart but not genius dark matter developers there.

Anyway, let's assume that we have a giant AI alignment project that would cost billions. To fund this, we could:

  1. Expand EA funding substantially using community building.
  2. Ask the government to fund the project.

The government has a lot of money but it seems challenging to convince the government to fund AI alignment compared to getting funding from EA. So maybe some EAs with government expertise could work with the government to increase AI safety investment.

If the AI safety project gets EA funding, I think it needs to be cost-effective. The reality is that only ~12% of Open Phil's money is spent on AI safety. The reason why is that there is a triage situation with other cause areas like biosecurity, farm animal welfare, and global health and development so the goal is to find cost-effective ways to spend money on AI safety. The project needs to be competitive and has more value on the margin than other proposals.

In my opinion, the government projects that are most likely to succeed are those that build on or are similar to recent successful projects and are in the Overton window. For example:

My guess is that leveraging academia would be effective and scalable because you can build on the pre-existing talent, leadership, culture, and infrastructure. Alternatively, governments could create new regulations or laws to influence the behavior of companies (e.g. GDPR). Or they could found new think tanks or research institutes possibly in collaboration with universities or companies.

As for the school ideas, I've heard that Lee Sedol went to a Go school and as you mentioned, Soviet chess was fueled by Soviet chess programs. China has intensive sports schools but I doubt these kinds of schools would be considered acceptable in Western countries which is an important consideration given that most of AI safety work happens in Western countries like the US and UK.

In science fiction, there are even more extreme programs like the Spartan program in Halo where children were kidnapped and turned into super soldiers, or Star Wars where clone soldiers were grown and trained in special facilities.

I don't think these kinds of extreme programs would work. Advanced technologies like human cloning could take decades to develop and are illegal in many countries. Also, they sound highly unethical which is a major barrier to their success in modern developed countries like the US and especially EA-adjacent communities like AI safety.

I think a more realistic idea is something like the Atlas Fellowship or SERI MATS which are voluntary programs for aspiring researchers in their teens or twenties.

The geniuses I know of that were trained from an early age in Western-style countries are Mozart (music), Von Neumann (math), John Stuart Mill (philosophy), and Judit Polgár (chess). In all these cases, they were gifted children who lived in normal nuclear families and had ambitious parents and extra tutoring.

Comment by Stephen McAleese (stephen-mcaleese) on The virtue of determination · 2023-07-11T09:58:37.160Z · LW · GW

Thanks for writing this. I thought it was really enlightening.

I think longtermism and the idea that we could influence a vast future is a really interesting and important idea. As you said, future people will probably see us as both incredibly incompetent and influential. Maybe the most rational response to this situation is to be determined to make the future go well.

I also thought it was really insightful how you mentioned that discovering past truths must have been hard. We take science and technology for granted but much of it is the result of cumulative hard work over time by scientists and engineers.

Comment by Stephen McAleese (stephen-mcaleese) on Lessons On How To Get Things Right On The First Try · 2023-07-07T10:34:33.625Z · LW · GW

Some lessons from the exercise:
- All models and beliefs are wrong to some extent and the best way to find out how wrong they are is to put them to the test in experiments.  The map is not the territory.
- Be careful when applying existing knowledge and models to new problems or using analogies. The problem might require fresh thinking, new concepts, or a new paradigm.
- It's good to have a lot of different people working on a problem because each person can attack the problem in their own unique way from a different angle.  A lot of people may fail but some could succeed.
- Don't flinch away from anomalies.  A lot of scientific progress has resulted from changing models to account for anomalies (see The Structure of Scientific Revolutions).

Comment by Stephen McAleese (stephen-mcaleese) on When do "brains beat brawn" in Chess? An experiment · 2023-07-02T18:39:06.216Z · LW · GW

Thanks for the post! It was a good read. One point I don't think was brought up is the fact that chess is turn-based whereas real life is continuous.

Consequently, the huge speed advantage that AIs have is not that useful in chess because the AI still has to wait for you to make a move before it can move.

But since real life is continuous, if the AI is much faster than you, it could make 1000 'moves' for every move you make and therefore speed is a much bigger advantage in real life.

Comment by Stephen McAleese (stephen-mcaleese) on Lightcone Infrastructure/LessWrong is looking for funding · 2023-06-16T13:55:15.967Z · LW · GW

I like LessWrong and I visit the site quite often. I would be willing to pay a monthly subscription for the site especially if the subscription included extra features.

Maybe LessWrong could raise money in a similar way to Twitter: by offering a paid premium version.

A Fermi estimate of how much revenue LessWrong could generate:

  • As far as I know, the site gets about 100,000 monthly visitors.
  • If 10,000 of them signed up for the premium subscription at $10 per month, then LessWrong could generate $100,000 in revenue per month or $1.2 million per year which would cover 20-40% of the $3-6 million funding gap mentioned.
Comment by Stephen McAleese (stephen-mcaleese) on GPT-4 Predictions · 2023-06-16T09:50:47.957Z · LW · GW

I used the estimate from a document named What's In My AI? which estimates that the GPT-2 training dataset contains 15B tokens.

A quick way to estimate the total number of training tokens is to multiply the training dataset size in gigabytes by the number of tokens per byte which is typically about 0.25 according to the Pile paper. So 40B x 0.25 = 10 billion.

Comment by Stephen McAleese (stephen-mcaleese) on Why I'm Not (Yet) A Full-Time Technical Alignment Researcher · 2023-05-28T12:51:15.053Z · LW · GW

For context, I have a very similar background to you - I'm a software engineer with a computer science degree interested in working on AI alignment.

LTFF granted about $10 million last year. Even if all that money were spent on independent AI alignment researchers, if each researcher costs $100k per year, then there would only be enough money to fund about 100 researchers in the world per year so I don't see LTFF as a scalable solution.

Unlike software engineering,  AI alignment research tends to be neglected and underfunded because it's not an activity that can easily be made profitable. That's one reason why there are far more software engineers than AI alignment researchers.

Work that is unprofitable but beneficial such as basic science research has traditionally been done by university researchers who, to the best of my knowledge, are mainly funded by government grants.

I have also considered becoming independently wealthy to work on AI alignment in the past but that strategy seems too slow if AGI will be created relatively soon.

So my plan is to apply for jobs at organizations like Redwood Research or apply for funding from LTFF and if those plans fail, I will consider getting a PhD and getting funding from the government instead which seems more scalable.