Posts

Geoffrey Hinton on the Past, Present, and Future of AI 2024-10-12T16:41:56.796Z
Could We Automate AI Alignment Research? 2023-08-10T12:17:05.194Z
An Overview of the AI Safety Funding Situation 2023-07-12T14:54:36.732Z
Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4 2023-03-17T18:34:17.178Z
GPT-4 Predictions 2023-02-17T23:20:24.696Z
Stephen McAleese's Shortform 2023-01-08T21:46:25.888Z
AGI as a Black Swan Event 2022-12-04T23:00:53.802Z
Estimating the Current and Future Number of AI Safety Researchers 2022-09-28T21:11:33.703Z
How Do AI Timelines Affect Existential Risk? 2022-08-29T16:57:44.107Z
Summary of "AGI Ruin: A List of Lethalities" 2022-06-10T22:35:48.500Z

Comments

Comment by Stephen McAleese (stephen-mcaleese) on o1: A Technical Primer · 2024-12-17T18:09:55.040Z · LW · GW

Here is a recent blog post by Hugging Face explaining how to make an o1-like model using open weights models like Llama 3.1.

Comment by Stephen McAleese (stephen-mcaleese) on O O's Shortform · 2024-12-08T10:44:53.199Z · LW · GW

Why? O1 is much more capable than GPT-4o at math, programming, and science.

Comment by Stephen McAleese (stephen-mcaleese) on Stephen McAleese's Shortform · 2024-12-08T10:42:48.950Z · LW · GW

Here's an argument for why current alignment methods like RLHF are already much better than what evolution can do.

Evolution has to encode information about the human brain's reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don't generalize well like "sweet foods are good".

In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.

  1. ^
Comment by Stephen McAleese (stephen-mcaleese) on johnswentworth's Shortform · 2024-12-06T21:33:14.706Z · LW · GW

One thing I've noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don't have any syntax or logical errors. I don't think that was possible with earlier models like GPT-3.5.

Comment by Stephen McAleese (stephen-mcaleese) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T11:33:07.120Z · LW · GW

I donated $100, roughly equivalent to my yearly spending on Twitter/X Premium, because I believe LessWrong offers similar value. I would encourage most readers to do the same.

Update: I've now donated $500 in total for philanthropic reasons.

Comment by Stephen McAleese (stephen-mcaleese) on You should consider applying to PhDs (soon!) · 2024-11-30T11:16:34.297Z · LW · GW

If you're interested in doing a PhD in AI in the UK, I recommend applying for the Centres for Doctoral Training (CDTs) in AI such as:

Note that these programs are competitive so the acceptance rate is ~10%.

Comment by Stephen McAleese (stephen-mcaleese) on Evaluating the historical value misspecification argument · 2024-11-13T10:02:28.193Z · LW · GW

I agree. I don't see a clear distinction between what's in the model's predictive model and what's in the model's preferences. Here is a line from the paper "Learning to summarize from human feedback":

"To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x."

Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.

Comment by Stephen McAleese (stephen-mcaleese) on An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs · 2024-11-13T09:28:27.013Z · LW · GW

I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.

Comment by Stephen McAleese (stephen-mcaleese) on When is reward ever the optimization target? · 2024-10-19T16:06:04.117Z · LW · GW

I'll use the definition of optimization from Wikipedia: "Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives".

Best-of-n or rejection sampling is an alternative to RLHF which involves generating  responses from an LLM and returning the one with the highest reward model score. I think it's reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.

I'd also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, "At each time step  of each simulation, an action  is selected from state  so as to maximize action value plus a bonus" and the formula is:  where  is an exploration bonus.

Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or -1 depending on who wins).

The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or -1.

So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.

Comment by Stephen McAleese (stephen-mcaleese) on Geoffrey Hinton on the Past, Present, and Future of AI · 2024-10-14T21:47:06.432Z · LW · GW

SummaryBot summary from the EA Forum:

Executive summary: Geoffrey Hinton, a pioneer in AI, discusses the history and current state of neural networks, and warns about potential existential risks from superintelligent AI while suggesting ways to mitigate these risks.

Key points:

  1. Neural networks, initially unpopular, became dominant in AI due to increased computational power and data availability.
  2. Hinton argues that large language models (LLMs) truly understand language, similar to how the human brain processes information.
  3. Digital neural networks have advantages over biological ones, including easier information sharing and potentially superior learning algorithms.
  4. Hinton believes there's a 50% chance AI will surpass human intelligence within 20 years, with a 10-20% risk of causing human extinction.
  5. To mitigate risks, Hinton suggests government-mandated AI safety research and international cooperation.
  6. Two possible future scenarios: AI takeover leading to human extinction, or humans successfully coexisting with superintelligent AI assistants.
Comment by Stephen McAleese (stephen-mcaleese) on Geoffrey Hinton on the Past, Present, and Future of AI · 2024-10-14T08:16:49.405Z · LW · GW

Maybe. The analogy he gives is that the AI could be like a very intelligent personal assistant to a relatively dumb CEO. The CEO is still in charge but it makes sense to delegate a lot of tasks to the more competent assistant.

The parent and child outcome seems a bit worse than that because usually a small child is completely dependent on their parent and all their resources are controlled by the parent unless they have pocket money or something like that.

Comment by Stephen McAleese (stephen-mcaleese) on Geoffrey Hinton on the Past, Present, and Future of AI · 2024-10-13T08:11:36.757Z · LW · GW

It's an original LessWrong post by me. Though all the quotes and references are from external sources.

Comment by Stephen McAleese (stephen-mcaleese) on Open Thread Fall 2024 · 2024-10-12T18:48:31.598Z · LW · GW

There's a rule of thumb called the "1% rule" on the internet that 1% of users contribute to a forum and 99% only read the forum.

Comment by Stephen McAleese (stephen-mcaleese) on How difficult is AI Alignment? · 2024-09-18T08:20:02.443Z · LW · GW

Thank you for the insightful comment.

On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.

In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.

In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren't any harder to implement than earlier, simpler ones.

Comment by Stephen McAleese (stephen-mcaleese) on Stephen McAleese's Shortform · 2024-09-14T19:52:28.164Z · LW · GW

The rate of progress on the MATH dataset is incredible and faster than I expected.

The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.

The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.

But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.

So it seems like we're getting 2028 performance on the MATH dataset already in 2024.

Quote from the blog post:

"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this."

Comment by Stephen McAleese (stephen-mcaleese) on How difficult is AI Alignment? · 2024-09-14T16:43:02.699Z · LW · GW

Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.

The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there are other reasons, such as that higher level failures cannot yet be experimentally demonstrated, so developing mitigations for them has to rely on (possibly unrepresentative) toy models instead of reacting to the failures of current systems.

Note that although implementing better alignment solutions would probably be more costly, advancements in AI capabilities could flatten the cost curve by automating some of the work. For example, constitutional AI seems significantly more complex than regular RLHF, but it might not be much harder for organizations to implement due to partial automation (e.g. RLAIF). So even if future alignment techniques are much more complex than today, they might not be significantly harder to implement (in terms of human effort) due to increased automation and AI involvement.

Comment by Stephen McAleese (stephen-mcaleese) on Solving adversarial attacks in computer vision as a baby version of general AI alignment · 2024-08-31T11:51:36.890Z · LW · GW

Nice paper! I found reading it quite insightful. Here are some key extracts from the paper:

Improving adversarial robustness by classifying several down-sampled noisy images at once:

"Drawing inspiration from biology [eye saccades], we use multiple versions of the same image at once, downsampled to lower resolutions and augmented with stochastic jitter and noise. We train a model to
classify this channel-wise stack of images simultaneously. We show that this by default yields gains in adversarial robustness without any explicit adversarial training."

Improving adversarial robustness by using an ensemble of intermediate layer predictions:

"Using intermediate layer predictions. We show experimentally that a successful adversarial
attack on a classifier does not fully confuse its intermediate layer features (see Figure 5). An
image of a dog attacked to look like e.g. a car to the classifier still has predominantly dog-like
intermediate layer features. We harness this de-correlation as an active defense by CrossMax
ensembling the predictions of intermediate layers. This allows the network to dynamically
respond to the attack, forcing it to produce consistent attacks over all layers, leading to robustness
and interpretability."

Comment by Stephen McAleese (stephen-mcaleese) on Anna and Oliver discuss Children and X-Risk · 2024-07-25T19:44:02.633Z · LW · GW

I suspect the desire for kids/lineage is really basic for a lot of people (almost everyone?)

This seems like an important point. One of the arguments for the inner alignment problem is that evolution intended to select humans for inclusive genetic fitness (IGF) but humans were instead motivated by other goals (e.g. seeking sex) that were strongly correlated with IGF in the ancestral environment.

Then when humans' environment changed (e.g. the invention of birth control), the correlation between these proxy goals and IGF broke down resulting in low fitness and inner misalignment.

However this statement seems to suggest that modern humans really have internalized IGF as one of their primary objectives and that they're inner aligned with evolution's outer objective.

Comment by Stephen McAleese (stephen-mcaleese) on Leon Lang's Shortform · 2024-07-03T22:15:53.163Z · LW · GW

I think the Zotero PDF reader has a lot of similar features that make the experience of reading papers much better:

  • It has a back button so that when you click on a reference link that takes you to the references section, you can easily click the button to go back to the text.
  • There is a highlight feature so that you can highlight parts of the text which is convenient when you want to come back and skim the paper later.
  • There is a "sticky note" feature allowing you to leave a note in part of the paper to explain something.
Comment by Stephen McAleese (stephen-mcaleese) on Boycott OpenAI · 2024-06-19T22:24:50.808Z · LW · GW

I was thinking of doing this but the ChatGPT web app seems to have many features that are only available there and add a lot of value such as Code Interpreter, PDF uploads, DALL-E, and using custom GPTs so I still use ChatGPT Plus.

Comment by Stephen McAleese (stephen-mcaleese) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-01T07:21:34.314Z · LW · GW

Thank you for the blog post. I thought it was very informative regarding the risk of autonomous replication in AIs.

It seems like the Centre for AI Security is a new organization.

I've seen the announcement post on it's website. Maybe it would be a good idea to cross-post it to LessWrong as well.

Comment by Stephen McAleese (stephen-mcaleese) on MIRI 2024 Communications Strategy · 2024-05-30T19:16:37.083Z · LW · GW

Is MIRI still doing technical alignment research as well?

Comment by Stephen McAleese (stephen-mcaleese) on Talent Needs of Technical AI Safety Teams · 2024-05-27T21:15:43.150Z · LW · GW

This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.

Comment by Stephen McAleese (stephen-mcaleese) on An Overview of the AI Safety Funding Situation · 2024-05-26T09:52:33.818Z · LW · GW

Thanks for the table, it provides a good summary of the post's findings. It might also worthwhile to also add it to the EA Forum post as well.

I think the table should include the $10 million in OpenAI Superalignment fast grants as well.

Comment by Stephen McAleese (stephen-mcaleese) on How do you feel about LessWrong these days? [Open feedback thread] · 2024-05-26T09:47:03.379Z · LW · GW

I think there are some great points in this comment but I think it's overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community's culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of upvotes, so apparently they are positively received.

LessWrong these days is huge with probably over 100,000 monthly readers so I think it's challenging to summarize its culture in any particularly way (e.g. probably most users on LessWrong live outside the bay area and maybe even outside the US). I personally find that LessWrong as a whole is fairly meritocratic and not that dogmatic, and that a wide variety of views are supported provided that they are sufficiently well-argued.

In addition to LessWrong, I use some other related sites such as Twitter, Reddit, and Hacker News and although there may be problems with the discourse on LessWrong, I think it's generally significantly worse on these other sites. Even today, I'm sure you can find people saying things on Twitter about how AIs can't have goals or that wanting paperclips is stupid. These kinds of comments wouldn't be tolerated on LessWrong because they're ignorant and a waste of time. Human nature can be prone to ignorance, rigidness of opinions and so on but I think the LessWrong walled garden has been able to counteract these negative tendencies better than most other sites.

Comment by Stephen McAleese (stephen-mcaleese) on Alexander Gietelink Oldenziel's Shortform · 2024-05-15T18:45:22.499Z · LW · GW

State-of-the-art models such as Gemini aren't LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.

Comment by Stephen McAleese (stephen-mcaleese) on Catastrophic Goodhart in RL with KL penalty · 2024-05-15T13:48:04.512Z · LW · GW
  • Regularize by a function other than KL divergence. For heavy-tailed error distributions, KL divergence doesn’t work, but capping the maximum odds ratio for any action (similar to quantilizers) still results in positive utility.

A recent paper from UC Berkeley named Preventing Reward Hacking with Occupancy Measure Regularization proposes replacing KL divergence regularization with occupancy measure (OM) regularization. OM regularization involves regularizing based on the state or state-action distribution rather than the the action distribution:

"Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD [action distribution] divergence to prevent reward hacking"

The idea is that regularizing to minimize changes in the action distribution isn't always safe because small changes in the action distribution can cause large changes in the states visited by the agent:

Suppose we have access to a safe policy that drives slowly and avoids falling off the cliff. However, the car is optimizing a proxy reward function that prioritizes quickly reaching the destination, but not necessarily staying on the road. If we try to regularize the car’s action distributions to the safe policy, we will need to apply heavy regularization, since only slightly increasing the probability of some unsafe action (e.g., making a sharp right turn) can lead to disaster.

...

Our proposal follows naturally from this observation: to avoid reward hacking, regularize based on divergence from the safe policy’s occupancy measure, rather than action distribution.  A policy’s occupancy measure (OM) is the distribution of states or state-action pairs seen by a policy when it interacts with its environment.

Comment by Stephen McAleese (stephen-mcaleese) on Alexander Gietelink Oldenziel's Shortform · 2024-05-14T10:26:32.495Z · LW · GW

I just asked GPT-4 a GSM8K problem and I agree with your point. I think what's happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it's no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to "respond with just a single number" to eliminate the chain-of-thought reasoning it's problem-solving ability is much worse.

Comment by Stephen McAleese (stephen-mcaleese) on Alexander Gietelink Oldenziel's Shortform · 2024-05-14T08:29:46.613Z · LW · GW

Chain-of-thought prompting makes models much more capable. In the original paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.

I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.

Comment by Stephen McAleese (stephen-mcaleese) on We are headed into an extreme compute overhang · 2024-04-27T09:25:19.358Z · LW · GW

Currently, groups of LLM agents can collaborate using frameworks such as ChatDev, which simulates a virtual software company using LLM agents with different roles. Though I think human organizations are still more effective for now.  For example, corporations such as Microsoft have over 200,000 employees and can work on multi-year projects. But it's conceivable that in the future there could be virtual companies composed of millions of AIs that can coordinate effectively and can work continuously at superhuman speed for long periods of time.

Comment by Stephen McAleese (stephen-mcaleese) on Estimating the Current and Future Number of AI Safety Researchers · 2024-04-26T09:11:58.500Z · LW · GW

I think I might create a new post using information from this post which covers the new AI alignment landscape.

Comment by Stephen McAleese (stephen-mcaleese) on More people getting into AI safety should do a PhD · 2024-04-14T11:42:52.969Z · LW · GW

I think this section of the post is slightly overstating the opportunity cost of doing a PhD. PhD students typically spend most of their time on research so ideally, they should be doing AI safety research during the PhD (e.g. like Stephen Casper). If the PhD is in an unrelated field or for the sake of upskilling then there is a more significant opportunity cost relative to working directly for an AI safety organization.

Comment by stephen-mcaleese on [deleted post] 2024-04-13T09:50:29.593Z

Thank you for explaining PPO. In the context of AI alignment, it may be worth understanding in detail because it's the core algorithm at the heart of RLHF. I wonder if any of the specific implementation details of PPO or how it's different from other RL algorithms have implications for AI alignment. To learn more about PPO and RLHF, I recommend reading this paper: Secrets of RLHF in Large Language Models Part I: PPO.

Comment by Stephen McAleese (stephen-mcaleese) on LLMs for Alignment Research: a safety priority? · 2024-04-05T19:08:58.192Z · LW · GW

From reading the codebase, it seems to be a LangChain chatbot powered by the default LangChain OpenAI model which is gpt-3.5-turbo-instruct. The announcement blog post also says it's based on gpt-3.5-turbo.

Comment by Stephen McAleese (stephen-mcaleese) on LLMs for Alignment Research: a safety priority? · 2024-04-05T08:43:00.503Z · LW · GW

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data. The AI Safety Chatbot partially solves this problem using retrieval-augmented generation (RAG) on a database of articles from https://aisafety.info. There also seem to be plans to fine-tune it on a dataset of alignment articles.

Comment by Stephen McAleese (stephen-mcaleese) on Reward is not the optimization target · 2024-04-04T22:30:41.927Z · LW · GW

OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.

But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state  and then chooses the action with the highest value.

Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?

Comment by Stephen McAleese (stephen-mcaleese) on Modern Transformers are AGI, and Human-Level · 2024-03-31T11:16:04.758Z · LW · GW

I agree. GPT-4 is an AGI for the kinds of tasks I care about such as programming and writing. ChatGPT4 in its current form (with the ability to write and execute code) seems to be at the expert human level in many technical and quantitative subjects such as statistics and programming.

For example, last year I was amazed when I gave ChatGPT4 one of my statistics past exam papers and it got all the questions right except for one which involved interpreting an image of a linear regression graph. The questions typically involve understanding the question, thinking of an appropriate statistical method, and doing calculations to find the right answer. Here's an example question:

Times (in minutes) for a sample of 8 players are presented in Table 1 below. Using an appropriate test at the 5% significance level, investigate whether there is evidence of a decrease in the players’ mean 5k time after the six weeks of training. State clearly your assumptions and conclusions, and report a p-value for your test statistic.

The solution to this question is a paired sample t-test.

Sure, GPT-4 has probably seen similar questions before but so do students since they can practice past papers.

This year, one of my professors designed his optimization assignment to be ChatGPT-proof but I found that it could still solve five out of six questions successfully. The questions involved converting natural language descriptions of optimization problems into mathematical formulations and solving them with a program.

One of the few times I've seen GPT-4 genuinely struggle to do a task is when I asked it to solve a variant of the Zebra Puzzle which is a challenging logical reasoning puzzle that involves updating a table based on limited information and using logical reasoning and a process of elimination to find the correct answer.

Comment by Stephen McAleese (stephen-mcaleese) on Can we get an AI to "do our alignment homework for us"? · 2024-03-01T23:29:18.859Z · LW · GW

I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:

  • There's a chicken-and-egg problem where you need the automated alignment researcher to create the alignment solution but the alignment solution is needed before you can safely create the automated alignment researcher. The solution to this dilemma is an iterative bootstrapping process where the AI's capabilities and alignment iteratively improve each other (a more aligned AI can be made more capable and a more capable AI can create a more aligned AI and so on).
  • Creating the automated alignment researcher only makes sense if it is less capable and general than a full-blown AGI. Otherwise, aligning it is just as hard as aligning AGI.

There's no clear answer to this question because it depends on your definition of "AI alignment" work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.

The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.

However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.

I can't find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can't easily be automated so Eliezer's pessimism makes sense in light of that information.

Further reading:

Comment by Stephen McAleese (stephen-mcaleese) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-29T22:42:26.800Z · LW · GW

Strong upvote. I think this is an excellent, carefully written, and timely post. Explaining issues that may arise from current alignment methods is urgent and important. It provides a good explanation of the unidentifiability or inner alignment problem that could arise from advanced AIs systems trained with current behavioral safety methods. It also highlights the difficulty of making AIs that can automate alignment research which is part of OpenAI's current plan. I also liked the in-depth description of what advanced science AIs would be capable of as well as the difficulty of keeping humans in the loop.

Comment by Stephen McAleese (stephen-mcaleese) on Refusal mechanisms: initial experiments with Llama-2-7b-chat · 2024-01-03T17:49:55.009Z · LW · GW

Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.

Comment by stephen-mcaleese on [deleted post] 2023-12-22T10:33:36.292Z

TL;DR: Private AI companies such as Anthropic which have revenue-generating products and also invest heavily in AI safety seem like the best type of organization for doing AI safety research today. This is not the best option in an ideal world and maybe not in the future but right now I think it is.

I appreciate the idealism and I'm sure there is some possible universe where shutting down these labs would make sense but I'm quite unsure about whether doing so would actually be net-beneficial in our world and I think there's a good chance it would be net-negative in reality.

The most glaring constraint is finances. AI safety is funding-constrained so this is worth mentioning. Companies like DeepMind and OpenAI spend hundreds of millions of dollars per year on staff and compute and I doubt that would be possible in a non-profit. Most of the non-profits working on AI safety (e.g. Redwood Research) are small with just a handful of people. OpenAI changed their company from a non-profit to a capped for-profit because they realized that being a non-profit would have been insufficient for scaling their company and spending. OpenAI now generates $1 billion in revenue and I think it's pretty implausible that a non-profit could generate that amount of income.

The other alternative apart from for-profit companies and philanthropic donations is government funding. It is true that governments fund a lot of science. For example, the US government funds 40% of basic science research. And a lot of successful big science projects such as CERN and the ITER fusion project seem to be mostly government-funded. However, I would expect a lot of government-funded academic AI safety grants to be wasted by professors skilled at putting "AI safety" in their grant applications so that they can fund whatever they were going to work on anyway. Also, the fact that the US government has secured voluntary commitments from AI labs to build AI safely gives me the impression that governments are either unwilling or incapable of working on AI safety and instead would prefer to delegate it to private companies. On the other hand, the UK has a new AI safety institute and a language model task force.

Another key point is research quality. In my opinion, the best AI safety research is done by the big labs. For example, Anthropic created constitutional AI and they also seem to be a leader in interpretability research. I think empirical AI safety work and AI capabilities work involve very similar skills (coding etc.) and therefore it's not surprising that leading AI labs also do the best empirical AI safety work. There are several other reasons for explaining why big AI labs do the best empirical AI safety work. One is talent. Top labs have the money to pay high salaries which attracts top talent. Work in big labs also seems more collaborative than in academia which seems important for large projects. Many top projects have dozens of authors (e.g. the Llama 2 paper). Finally, there is compute. Right now, only big labs have the infrastructure necessary to do experiments on leading models. Doing experiments such as fine-tuning large models requires a lot of money and hardware. For example, this paper by DeepMind on reducing sycophancy apparently involved fine-tuning the 540B PaLM model which is probably not possible for most independent and academic researchers right now and consequently, they usually have to work with smaller models such as Llama-2-7b. However, the UK is investing in some new public AI supercomputers which hopefully will level the playing field somewhat. If you think theoretical work (e.g. agent foundations) is more important than empirical work then big labs have less of an advantage. Though DeepMind is doing some of that too.

Comment by Stephen McAleese (stephen-mcaleese) on Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" · 2023-11-21T22:58:56.311Z · LW · GW

GPT-4 is the model that has been trained with the most training compute which suggests that compute is the most important factor for capabilities. If that wasn't true, we would see some other company training models with more compute but worse performance which doesn't seem to be happening.

Comment by Stephen McAleese (stephen-mcaleese) on The other side of the tidal wave · 2023-11-03T22:31:15.088Z · LW · GW

No offense but I sense status quo bias in this post.

If you replace "AI" with "industrial revolution" I don't think the meaning of the text changes much and I expect most people would rather live today than in the Middle Ages.

One thing that might be concerning is that older generations (us in the future) might not have the ability to adapt to a drastically different world in the same way that some old people today struggle to use the internet.

I personally don't expect to be overly nostalgic in the future because I'm not that impressed by the current state of the world: factory farming, the hedonic treadmill, physical and mental illness, wage slavery, aging, and ignorance are all problems that I hope are solved in the future.

Comment by Stephen McAleese (stephen-mcaleese) on My thoughts on the social response to AI risk · 2023-11-02T22:56:04.560Z · LW · GW

Although AI progress is occurring gradually right now where regulation can keep up, I do think a hard takeoff is still a possibility.

My understanding is that fast recursive self-improvement occurs once there is a closed loop of fully autonomous self-improving AI. AI is not capable enough for that yet and most of the important aspects of AI research are still done by humans but it could become a possibility in the future once AI agents are advanced and reliable enough.

In the future before an intelligence explosion, there could be a lot of regulation and awareness of AI relative to today. But if there's a fast takeoff, regulation would be unable to keep up with AI progress.

Comment by Stephen McAleese (stephen-mcaleese) on Intelligence Enhancement (Monthly Thread) 13 Oct 2023 · 2023-10-13T21:17:14.540Z · LW · GW

Recently I learned that the negative effect of sleep deprivation on cognitive performance seems to accumulate over several days. Five days of insufficient sleep can lower cognitive performance by up to 15 IQ points according to this source.

Comment by Stephen McAleese (stephen-mcaleese) on What's your standard for good work performance? · 2023-09-27T20:05:21.713Z · LW · GW

I personally use Toggl to track how much time I spend working per day. I usually aim for at least four hours of focused work per day.

Comment by Stephen McAleese (stephen-mcaleese) on There should be more AI safety orgs · 2023-09-22T12:12:34.949Z · LW · GW

Thanks for the post! I think it does a good job of describing key challenges in AI field-building and funding.

The talent gap section describes a lack of positions in industry organizations and independent research groups such as SERI MATS. However, there doesn't seem to be much content on the state of academic AI safety research groups. So I'd like to emphasize the current and potential importance of academia for doing AI safety research and absorbing talent. The 80,000 Hours AI risk page says that there are several academic groups working on AI safety including the Algorithmic Alignment Group at MIT, CHAI in Berkeley, the NYU Alignment Research Group, and David Krueger's group in Cambridge.

The AI field as a whole is already much larger than the AI safety field so I think analyzing the AI field is useful from a field-building perspective. For example, about 60,000 researchers attended AI conferences worldwide in 2022. There's an excellent report on the state of AI research called Measuring Trends in Artificial Intelligence. The report says that most AI publications come from the 'education' sector which is probably mostly universities. 75% of AI publications come from the education sector and the rest are published by non-profits, industry, and governments. Surprisingly, the top 9 institutions by annual AI publication count are all Chinese universities and MIT is in 10th place. Though the US and industry are still far ahead in 'significant' or state-of-the-art ML systems such as PaLM and GPT-4.

What about the demographics of AI conference attendees? At NeurIPS 2021, the top institutions by publication count were Google, Stanford, MIT, CMU, UC Berkeley, and Microsoft which shows that both industry and academia play a large role in publishing papers at AI conferences.

Another way to get an idea of where people work in the AI field is to find out where AI PhD students go after graduating in the US. The number of AI PhD students going to industry jobs has increased over the past several years and 65% of PhD students now go into industry but 28% still go into academic jobs.

Only a few academic groups seem to be working on AI safety and many of the groups working on it are at highly selective universities but AI safety could become more popular in academia in the near future. And if the breakdown of contributions and demographics of AI safety will be like AI in general, then we should expect academia to play a major role in AI safety in the future. Long-term AI safety may actually be more academic than AI since universities are the largest contributor to basic research whereas industry is the largest contributor to applied research.

So in addition to founding an industry org or facilitating independent research, another path to field-building is to increase the representation of AI safety in academia by founding a new research group though this path may only be tractable for professors.

Comment by Stephen McAleese (stephen-mcaleese) on AI romantic partners will harm society if they go unregulated · 2023-09-06T12:58:05.654Z · LW · GW

Thanks for the post. It's great that people are discussing some of the less-frequently discussed potential impacts of AI.

I think a good example to bring up here is video games which seem to have similar risks. 

When you think about it, video games seem just as compelling as AI romantic partners. Many video games such as Call of Duty, Civilization, or League of Legends involve achieving virtual goals, leveling up, and improving skills in a way that's often more fulfilling than real life. Realistic 3D video games have been widespread since the 2000s but I don't think they have negatively impacted society all that much. Though some articles claim that video games are having a significant negative effect on young men.

Personally, I've spent quite a lot of time playing video games during my childhood and teenage years but I mostly stopped playing them once I went to college. But why replace an easy and fun way to achieve things with reality which is usually less rewarding and more frustrating? My answer is that achievements in reality are usually much more real, persistent, and valuable than achievements in video games. You can achieve a lot in video games but it's unlikely that you'll achieve goals that increase your status to as many people over a long period of time as you can in real life.

A relevant quote from the article I linked above:

"After a while I realized that becoming master of a fake world was not worth the dozens of hours a month it was costing me, and with profound regret I stashed my floppy disk of “Civilization” in a box and pushed it deep into my closet. I hope I never get addicted to anything like “Civilization” again."

Similarly, in the near term at least, AI romantic partners could be competitive with real relationships in the short term, but I doubt it will be possible to have AI relationships that are as fulfilling and realistic as a marriage that lasts several decades.

And as with the case of video games, status will probably favour real relationships causing people to value real relationships because they offer more status than virtual ones. One possible reason is that status depends on scarcity. Just as being a real billionaire offers much more status than being a virtual one, having a real high-quality romantic partner will probably yield much more status than a virtual one and as a result, people will be motivated to have real partners.

Comment by Stephen McAleese (stephen-mcaleese) on Could We Automate AI Alignment Research? · 2023-08-28T21:58:14.049Z · LW · GW

Some related posts on automating alignment research I discovered recently:

Comment by Stephen McAleese (stephen-mcaleese) on Could We Automate AI Alignment Research? · 2023-08-28T21:42:27.552Z · LW · GW

I agree that the difficulty of the alignment problem can be thought of as a diagonal line on the 2D chart above as you described.

This model may make having two axes instead of one unnecessary. If capabilities and alignment scale together predictably, then high alignment difficulty is associated with high capabilities, and therefore the capabilities axis could be unnecessary.

But I think there's value in having two axes. Another way to think about your AI alignment difficulty scale is like a vertical line in the 2D chart: for a given level of AI capability (e.g. pivotal AGI), there is uncertainty about how hard it would be to align such an AGI because the gradient of the diagonal line intersecting the vertical line is uncertain.

Instead of a single diagonal line, I now think the 2D model describes alignment difficulty in terms of the gradient of the line. An optimistic scenario is one where AI capabilities are scaled and few additional alignment problems arise or existing alignment problems do not become more severe because more capable AIs naturally follow human instructions and learn complex values. A highly optimistic possibility is that increased capabilities and alignment are almost perfectly correlated and arbitrarily capable AIs are no more difficult to align than current systems. Easy worlds correspond to lines in the 2D chart with low gradients and low-gradient lines intersect the vertical line corresponding to the 1D scale at a low point.

A pessimistic scenario can be represented in the chart as a steep line where alignment problems rapidly crop up as capabilities are increased. For example, in such hard worlds, increased capabilities could make deception and self-preservation much more likely to arise in AIs. Problems like goal misgeneralization might persist or worsen even in highly capable systems. Therefore, in hard worlds, AI alignment difficulty increases rapidly with capabilities and increased capabilities do not have helpful side effects such as the formation of natural abstrations that could curtail the increasing difficulty of the AI alignment problem. In hard worlds, since AI capabilities gains cause a rapid increase in alignment difficulty, the only way to ensure that alignment research keeps up with the rapidly increasing difficulty of the alignment problem is to limit progress in AI capabilities.