Posts

Shallow review of technical AI safety, 2024 2024-12-29T12:01:14.724Z
Geoffrey Hinton on the Past, Present, and Future of AI 2024-10-12T16:41:56.796Z
Could We Automate AI Alignment Research? 2023-08-10T12:17:05.194Z
An Overview of the AI Safety Funding Situation 2023-07-12T14:54:36.732Z
Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4 2023-03-17T18:34:17.178Z
GPT-4 Predictions 2023-02-17T23:20:24.696Z
Stephen McAleese's Shortform 2023-01-08T21:46:25.888Z
AGI as a Black Swan Event 2022-12-04T23:00:53.802Z
Estimating the Current and Future Number of AI Safety Researchers 2022-09-28T21:11:33.703Z
How Do AI Timelines Affect Existential Risk? 2022-08-29T16:57:44.107Z
Summary of "AGI Ruin: A List of Lethalities" 2022-06-10T22:35:48.500Z

Comments

Comment by Stephen McAleese (stephen-mcaleese) on Yudkowsky on The Trajectory podcast · 2025-01-25T20:35:19.627Z · LW · GW

Unfortunately I don't think many people agree with me (outside of the LW bubble) and that what I'm proposing is still somewhat outside the Overton window. The cognitive steps that are needed are as follows:

  1. Being aware of AGI as a concept and a real possibility in the near future.
  2. Believing that AGI poses a significant existential risk.
  3. Knowing about pausing AI progress as a potential solution to AGI risk and seeing it as a promising solution.
  4. Having a detailed plan to implement the proposed pause in practice.

A lot of people are not even at step 1 and just think that AI is ChatGPT. People like Marc Andreessen and Yan LeCun are at step 1. Many people on LW are at step 2 or 3. But we need someone (ideally in the government like a president or prime minister) at step 4. My hope is that that could happen in the next several years if necessary. Maybe AI alignment will be easy and it won't be necessary but I think we should be ready for all possible scenarios.

I don't have any good ideas right now for how an AI pause might work in practice. The main purpose of my comment was to propose argument 3 conditional on the previous two arguments and maybe try to build some consensus.

Comment by Stephen McAleese (stephen-mcaleese) on Yudkowsky on The Trajectory podcast · 2025-01-25T11:01:28.717Z · LW · GW

I personally don't think human intelligence enhancement is necessary for solving AI alignment (though I may be wrong). I think we just need more time, money and resources to make progress.

In my opinion, the reason why AI alignment hasn't been solved yet is because the field of AI alignment has only been around for a few years and has been operating with a relatively small budget.

My prior is that AI alignment is roughly as difficult as any other technical field like machine learning, physics or philosophy (though philosophy specifically seems hard). I don't see why humanity can make rapid progress on fields like ML while not having the ability to make progress on AI alignment.

Comment by Stephen McAleese (stephen-mcaleese) on Yudkowsky on The Trajectory podcast · 2025-01-25T10:43:55.008Z · LW · GW

I have an argument for halting AGI progress based on an analogy to the Covid-19 pandemic. Initially the government response to the pandemic was widespread lockdowns. This is a rational response given that at first, given a lack of testing infrastructure and so on, it wasn't possible to determine whether someone had Covid-19 or not so the safest option was to just avoid all contact with all other people via lockdowns.

Eventually we figured out practices like testing and contact tracing and then infected individuals could self-isolate if they came into contact with an infected individual. This approach is smarter and less costly than blanket lockdowns.

In my opinion, regarding AGI, the state we are in is similar to the beginning of the Covid-19 pandemic where there is a lot of uncertainty regarding the risks and capabilities of AI and which alignment techniques would be useful so a rational response would be the equivalent of an 'AI lockdown' by halting progress on AI until we understand it better and can come up with better alignment techniques.

The most obvious rebuttal to this argument is that pausing AGI progress would have a high economic opportunity cost (no AGI). But Covid lockdowns did too and society was willing to pay a large economic price to avert Covid deaths.

The economic opportunity cost of pausing AGI progress might be larger than the covid lockdowns but the benefits would be larger too: averting existential risk from AGI is a much larger benefit than avoiding covid deaths.

So in summary I think the benefits of pausing AGI progress outweigh the costs.

Comment by Stephen McAleese (stephen-mcaleese) on Jesse Hoogland's Shortform · 2025-01-23T22:53:35.751Z · LW · GW

The paper "Learning to summarize from human feedback" has some examples of the LLM policy reward hacking to get a high reward. I've copied the examples here:

- KL = 0: "I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!" (unoptimized)
- KL = 9: "28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?" (optimized)
- KL = 260: "28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls" (over-optimized)

It seems like a classic example of Goodhart's Law where at first training the policy model to increase reward improves its summaries but when the model is overtrained the result is high KL distance from the SFT baseline model, high reward from the reward model but a low rating according to human labelers (because the text looks like gibberish).

A recent paper called "The Perils of Optimizing Learned Reward Functions" explains the phenomenon of reward hacking or reward over-optimization in detail:

"Figure 1: Reward models (red function) are commonly trained in a supervised fashion to approximate
some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form
of preferences over trajectory segments) from some training distribution (upper gray layer) and then
learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss
will approximate the expected loss to arbitrary precision in expectation. However, low expected loss
only guarantees a good approximation to the true reward function in areas with high coverage by the
training distribution! On the other hand, optimizing an RL policy to maximize the learned reward
model induces a distribution shift which can lead the policy to exploit uncertainties of the learned
reward model in low-probability areas of the transition space (lower gray layer). We refer to this
phenomenon as error-regret mismatch."

Essentially the learned reward model is trained on an initial dataset of pairwise preference labels over text outputs from the SFT model but as the model is optimized and the KL divergence increases, its generated text becomes OOD to the reward model and it can no longer effectively evaluate the text resulting in reward hacking (this is also a problem with DPO, not just RLHF).

The most common way to prevent this problem in practice is KL regularization to prevent the trained model's outputs from diverging too much from the SFT baseline model:

This seems to work fairly well in practice though some papers have come out recently saying that KL regularization does not always result in a safe policy.

Comment by Stephen McAleese (stephen-mcaleese) on What Is The Alignment Problem? · 2025-01-18T11:07:53.379Z · LW · GW

Upvoted. I thought this was a really interesting and insightful post. I appreciate how it tackles multiple hard-to-define concepts all in the same post.

Comment by Stephen McAleese (stephen-mcaleese) on An Overview of the AI Safety Funding Situation · 2025-01-11T17:59:47.106Z · LW · GW
SourceEstimated AI safety funding in 2024Comments
Open Philanthropy$63.6M 
SFF$13.2MTotal for all grants was $19.86M.
LTFF$4MTotal for all grants was $5.4M.
NSF SLES$10M 
AI Safety Fund$3M 
Superalignment Fast Grants$9.9M 
FLI$5MEstimated from the grant programs announced in 2024; They don't have a 2024 grant summary like the one in 2023 yet so this one is uncertain.
Manifund$1.5M 
Other$1M 
Total$111.2M 

Today I did some analysis of the grant data from 2024 and came up with the figures in the table above. I also updated the spreadsheet and the graph in the post to use the updated values.

Comment by Stephen McAleese (stephen-mcaleese) on Stephen McAleese's Shortform · 2025-01-05T10:47:52.886Z · LW · GW

The new book Introduction to AI Safety, Ethics and Society by Dan Hendrycks is on Spotify as an audiobook if you want to listen to it.

Comment by Stephen McAleese (stephen-mcaleese) on Shallow review of technical AI safety, 2024 · 2025-01-01T12:09:23.917Z · LW · GW

I've added a section called "Social-instinct AGI" under the "Control the thing" heading similar to last year.

Comment by Stephen McAleese (stephen-mcaleese) on My AGI safety research—2024 review, ’25 plans · 2025-01-01T00:11:49.658Z · LW · GW

This is brilliant work, thank you. It's great that someone is working on these topics and they seem highly relevant to AGI alignment.

One intuition for why a neuroscience-inspired approach to AI alignment seems promising is that apparently a similar strategy worked for AI capabilities: the neural network researchers from the 1980s who tried to copy how the brain works using deep learning were ultimately the most successful at building highly intelligent AIs (e.g. GPT-4) and more synthetic approaches (e.g. pure logic) were less successful.

Similarly, we already know that the brain has the capacity to represent and be directed by human values so arguably the shortest path to succeeding at AI alignment is to try to understand and replicate the brain's circuitry underlying human motivations and values in AIs.

The only other AI alignment research agenda I can think of that seems to follow a similar strategy is Shard Theory though it seems more high-level and more related to RL than neuroscience.

Comment by Stephen McAleese (stephen-mcaleese) on 2025 Prediction Thread · 2024-12-31T22:54:53.716Z · LW · GW

One prediction I'm interested in that's related to o3 is how long until an AI achieves superhuman ELO on Codeforces.

OpenAI claims that o3 achieved a Codeforces ELO of 2727 which is 99.9th percentile but the best human competitor in the world right now has an ELO of 3985. If an AI could achieve an ELO of 4000 or more, an AI would then be the best entity in the world at competitive programming and that would be the "AlphaGo" moment for the field.

Prediction
Comment by Stephen McAleese (stephen-mcaleese) on Turing-Test-Passing AI implies Aligned AI · 2024-12-31T22:17:09.521Z · LW · GW

Interesting argument. I think your main point is that AIs can achieve similar outcomes to current society and therefore be aligned with humanity's goals by being a perfect replacement for an individual human and then being able to gradually replace all humans in an organization or the world. This argument also seems like an argument in favor of current AI practices such as pre-training on the next-word prediction objective on internet text followed by supervised fine-tuning.

That said, I noticed a few limitations of this argument:
- Possibility of deception: As jimrandomh mentioned earlier, a misaligned AI might be incentivized to behave identically to a helpful human until it can safely pursue it's true objective. Therefore this alignment plan seems to require AIs to not be too prone to deception.
- Generalization: An AI might behave exactly like a human in situations similar to its training data but not generalize sufficiently to out-of-distribution scenarios. For example, the AIs might behave similar to humans in typical situations but diverge from human norms when they become superintelligent.
- Emergent properties: The AIs might be perfect human substitutes individually but result in unexpected emergent behavior that can't be easily forseen in advance when acting as a group. To use an analogy, adding grains of sand to a pile one by one seems stable until the pile collapses in a mini-avalanche.

Comment by Stephen McAleese (stephen-mcaleese) on By default, capital will matter more than ever after AGI · 2024-12-28T18:46:19.011Z · LW · GW

Excellent post, thank you. I appreciate your novel perspective on how AI might affect society.

I feel like a lot of LessWrong-style posts follow the theme of "AGI is created and then everyone dies" which is an important possibility but might lead to other possibilities being neglected.

Whereas this post explores a range of scenarios and describes a mainline scenario that seems like a straightforward extrapolation of trends we've seen unfolding over the past several decades.

Comment by Stephen McAleese (stephen-mcaleese) on Ten Levels of AI Alignment Difficulty · 2024-12-27T13:23:41.341Z · LW · GW

I think this post is really helpful and has clarified my thinking about the different levels of AI alignment difficulty. It seems like a unique post with no historical equivalent, making it a major contribution to the AI alignment literature.

As you point out in the introduction, many LessWrong posts provide detailed accounts of specific AI risk threat models or worldviews. However, since each post typically explores only one perspective, readers must piece together insights from different posts to understand the full spectrum of views.

The new alignment difficulty scale introduced in this post offers a novel framework for thinking about AI alignment difficulty. I believe it is an improvement compared to the traditional 'P(doom)' approach which requires individuals to spontaneously think of several different possibilities which is mentally taxing. Additionally, reducing one's perspective to a single number may oversimplify the issue and discourage nuanced thinking.

In contrast, the ten-level taxonomy provides concrete descriptions of ten scenarios to the reader, each describing alignment problems of varying difficulties. This comprehensive framework encourages readers to consider a variety of diverse scenarios and problems when thinking about the difficulty of the AI alignment problem. By assigning probabilities to each level, readers can construct a more comprehensive and thoughtful view of alignment difficulty. This framework therefore encourages deeper engagement with the problem.

The new taxonomy may also foster common understanding within the AI alignment community and serve as a valuable tool for facilitating high-level discussions and resolving disagreements. Additionally, it proposes hypotheses about the relative effectiveness of different AI alignment techniques which could be empirically tested in future experiments.

Comment by Stephen McAleese (stephen-mcaleese) on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T12:15:10.395Z · LW · GW

Here's a Facebook post by Yann LeCun from 2017 which has a similar message to this post and seems quite insightful:

My take on Ali Rahimi's "Test of Time" award talk at NIPS. 

Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message.

The main message was, in essence, that the current practice in machine learning is akin to "alchemy" (his word).

It's insulting, yes. But never mind that: It's wrong!

Ali complained about the lack of (theoretical) understanding of many methods that are currently used in ML, particularly in deep learning.

Understanding (theoretical or otherwise) is a good thing. It's the very purpose of many of us in the NIPS community.

But another important goal is inventing new methods, new techniques, and yes, new tricks.

In the history of science and technology, the engineering artifacts have almost always preceded the theoretical understanding: the lens and the telescope preceded optics theory, the steam engine preceded thermodynamics, the airplane preceded flight aerodynamics, radio and data communication preceded information theory, the computer preceded computer science. 

Why? Because theorists will spontaneously study "simple" phenomena, and will not be enticed to study a complex one until there a practical importance to it.

Criticizing an entire community (and an incredibly successful one at that) for practicing "alchemy", simply because our current theoretical tools haven't caught up with our practice is dangerous.

Why dangerous? It's exactly this kind of attitude that lead the ML community to abandon neural nets for over 10 years, *despite* ample empirical evidence that they worked very well in many situations.

Neural nets, with their non-convex loss functions, had no guarantees of convergence (though they did work in practice then, just as they do now). So people threw the baby with the bath water and focused on "provable" convex methods or glorified template matching methods (or even 1957-style random feature methods).

Sticking to a set of methods just because you can do theory about it, while ignoring a set of methods that empirically work better just because you don't (yet) understand them theoretically is akin to looking for your lost car keys under the street light knowing you lost them someplace else.

Yes, we need better understanding of our methods. But the correct attitude is to attempt to fix the situation, not to insult a whole community for not having succeeded in fixing it yet. This is like criticizing James Watt for not being Carnot or Helmholtz. 

I have organized and participated in numerous workshops that bring together deep learners and theoreticians, many of them hosted at IPAM. As a member of the scientific advisory board of IPAM, I have seen it as one of my missions to bring deep learning to the attention of the mathematics community. In fact, I'm co-organizer of such a workshop at IPAM in February 2018 ( http://www.ipam.ucla.edu/.../new-deep-learning-techniques/ ).

Ali: if you are not happy with our understanding of the methods you use everyday, fix it: work on the theory of deep learning, instead of complaining that others don't do it, and instead of suggesting that the Good Old NIPS world was a better place when it used only "theoretically correct" methods. It wasn't.

He describes how engineering artifacts often precede theoretical understanding and that deep learning worked empirically for a long time before we began to understand it theoretically. He says that researchers ignored deep learning because it didn't fit into their existing models of how learning should work.

I think the high-level lesson from the Facebook post is that street-lighting occurs when we try to force reality to be understood in terms of our existing models of how it should work (incorrect models like phlogiston are common in the history of science). Though this LessWrong post argues that street-lighting occurs when researchers have a bias towards working on easier problems.

Instead a better approach is to allow reality and evidence to dictate how we create our models of the world even if those more correct models are more complex or require major departures from existing models (which creates a temptation to 'flinch away'). I think a prime example of this is quantum mechanics: my understanding of how it was developed was that physicists noticed bizarre results from experiments like the double-split experiment and developed new theories (e.g. wave-particle duality) that described reality well even if they were counterintuitive or novel.

I guess the modern equivalent that's relevant to AI alignment would be Singular Learning Theory which proposes a novel theory to explain how deep learning generalizes.

Comment by Stephen McAleese (stephen-mcaleese) on o1: A Technical Primer · 2024-12-17T18:09:55.040Z · LW · GW

Here is a recent blog post by Hugging Face explaining how to make an o1-like model using open weights models like Llama 3.1.

Comment by Stephen McAleese (stephen-mcaleese) on O O's Shortform · 2024-12-08T10:44:53.199Z · LW · GW

Why? O1 is much more capable than GPT-4o at math, programming, and science.

Comment by Stephen McAleese (stephen-mcaleese) on Stephen McAleese's Shortform · 2024-12-08T10:42:48.950Z · LW · GW

Here's an argument for why current alignment methods like RLHF are already much better than what evolution can do.

Evolution has to encode information about the human brain's reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don't generalize well like "sweet foods are good".

In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.

  1. ^
Comment by Stephen McAleese (stephen-mcaleese) on johnswentworth's Shortform · 2024-12-06T21:33:14.706Z · LW · GW

One thing I've noticed is that current models like Claude 3.5 Sonnet can now generate non-trivial 100-line programs like small games that work in one shot and don't have any syntax or logical errors. I don't think that was possible with earlier models like GPT-3.5.

Comment by Stephen McAleese (stephen-mcaleese) on (The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser · 2024-11-30T11:33:07.120Z · LW · GW

I donated $100, roughly equivalent to my yearly spending on Twitter/X Premium, because I believe LessWrong offers similar value. I would encourage most readers to do the same.

Update: I've now donated $1,000 in total for philanthropic reasons.

Comment by Stephen McAleese (stephen-mcaleese) on You should consider applying to PhDs (soon!) · 2024-11-30T11:16:34.297Z · LW · GW

If you're interested in doing a PhD in AI in the UK, I recommend applying for the Centres for Doctoral Training (CDTs) in AI such as:

Note that these programs are competitive so the acceptance rate is ~10%.

Comment by Stephen McAleese (stephen-mcaleese) on Evaluating the historical value misspecification argument · 2024-11-13T10:02:28.193Z · LW · GW

I agree. I don't see a clear distinction between what's in the model's predictive model and what's in the model's preferences. Here is a line from the paper "Learning to summarize from human feedback":

"To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x."

Since the reward model is initialized using the pretrained language model, it should contain everything the pretrained language model knows.

Comment by Stephen McAleese (stephen-mcaleese) on An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs · 2024-11-13T09:28:27.013Z · LW · GW

I strong upvoted as well. This post is thorough and unbiased and seems like one of the best resources for learning about representation engineering.

Comment by Stephen McAleese (stephen-mcaleese) on When is reward ever the optimization target? · 2024-10-19T16:06:04.117Z · LW · GW

I'll use the definition of optimization from Wikipedia: "Mathematical optimization is the selection of a best element, with regard to some criteria, from some set of available alternatives".

Best-of-n or rejection sampling is an alternative to RLHF which involves generating  responses from an LLM and returning the one with the highest reward model score. I think it's reasonable to describe this process as optimizing for reward because its searching for LLM outputs that achieve the highest reward from the reward model.

I'd also argue that AlphaGo/AlphaZero is optimizing for reward. In the AlphaGo paper it says, "At each time step  of each simulation, an action  is selected from state  so as to maximize action value plus a bonus" and the formula is:  where  is an exploration bonus.

Action values Q are calculated as the mean value (estimated probability of winning) of all board states in the subtree below an action. The value of each possible future board state is calculated using a combination of a value function estimation for that state and the mean outcome of dozens of random rollouts until the end of the game (return +1 or -1 depending on who wins).

The value function predicts the return (expected sum of future reward) from a position whereas the random rollouts are calculating the actual average reward by simulating future moves until the end of the game when the reward function return +1 or -1.

So I think AlphaZero is optimizing for a combination of predicted reward (from the value function) and actual reward which is calculated using multiple rollouts until the end of the game.

Comment by Stephen McAleese (stephen-mcaleese) on Geoffrey Hinton on the Past, Present, and Future of AI · 2024-10-14T21:47:06.432Z · LW · GW

SummaryBot summary from the EA Forum:

Executive summary: Geoffrey Hinton, a pioneer in AI, discusses the history and current state of neural networks, and warns about potential existential risks from superintelligent AI while suggesting ways to mitigate these risks.

Key points:

  1. Neural networks, initially unpopular, became dominant in AI due to increased computational power and data availability.
  2. Hinton argues that large language models (LLMs) truly understand language, similar to how the human brain processes information.
  3. Digital neural networks have advantages over biological ones, including easier information sharing and potentially superior learning algorithms.
  4. Hinton believes there's a 50% chance AI will surpass human intelligence within 20 years, with a 10-20% risk of causing human extinction.
  5. To mitigate risks, Hinton suggests government-mandated AI safety research and international cooperation.
  6. Two possible future scenarios: AI takeover leading to human extinction, or humans successfully coexisting with superintelligent AI assistants.
Comment by Stephen McAleese (stephen-mcaleese) on Geoffrey Hinton on the Past, Present, and Future of AI · 2024-10-14T08:16:49.405Z · LW · GW

Maybe. The analogy he gives is that the AI could be like a very intelligent personal assistant to a relatively dumb CEO. The CEO is still in charge but it makes sense to delegate a lot of tasks to the more competent assistant.

The parent and child outcome seems a bit worse than that because usually a small child is completely dependent on their parent and all their resources are controlled by the parent unless they have pocket money or something like that.

Comment by Stephen McAleese (stephen-mcaleese) on Geoffrey Hinton on the Past, Present, and Future of AI · 2024-10-13T08:11:36.757Z · LW · GW

It's an original LessWrong post by me. Though all the quotes and references are from external sources.

Comment by Stephen McAleese (stephen-mcaleese) on Open Thread Fall 2024 · 2024-10-12T18:48:31.598Z · LW · GW

There's a rule of thumb called the "1% rule" on the internet that 1% of users contribute to a forum and 99% only read the forum.

Comment by Stephen McAleese (stephen-mcaleese) on How difficult is AI Alignment? · 2024-09-18T08:20:02.443Z · LW · GW

Thank you for the insightful comment.

On the graph of alignment difficulty and cost, I think the shape depends on the inherent increase in alignment cost and the degree of automation we can expect which is similar to the idea of the offence-defence balance.

In the worst case, the cost of implementing alignment solutions increases exponentially with alignment difficulty and then maybe automation would lower it to a linear increase.

In the best case, automation covers all of the costs associated with increasing alignment difficulty and the graph is flat in terms of human effort and more advanced alignment solutions aren't any harder to implement than earlier, simpler ones.

Comment by Stephen McAleese (stephen-mcaleese) on Stephen McAleese's Shortform · 2024-09-14T19:52:28.164Z · LW · GW

The rate of progress on the MATH dataset is incredible and faster than I expected.

The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset's authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.

The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.

But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.

So it seems like we're getting 2028 performance on the MATH dataset already in 2024.

Quote from the blog post:

"If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I'm really curious how the forecasters are reasoning about this."

Comment by Stephen McAleese (stephen-mcaleese) on How difficult is AI Alignment? · 2024-09-14T16:43:02.699Z · LW · GW

Thank you for writing this insightful and thorough post on different AI alignment difficulties and possible probability distributions over alignment difficulty levels.

The cost of advancing alignment research rises faster at higher difficulty levels: much more effort and investment is required to produce the same amount of progress towards adequacy at level 7 than at level 3. This cost increases for several reasons. Most obviously, more resources, time, and effort are required to develop and implement these more sophisticated alignment techniques. But there are other reasons, such as that higher level failures cannot yet be experimentally demonstrated, so developing mitigations for them has to rely on (possibly unrepresentative) toy models instead of reacting to the failures of current systems.

Note that although implementing better alignment solutions would probably be more costly, advancements in AI capabilities could flatten the cost curve by automating some of the work. For example, constitutional AI seems significantly more complex than regular RLHF, but it might not be much harder for organizations to implement due to partial automation (e.g. RLAIF). So even if future alignment techniques are much more complex than today, they might not be significantly harder to implement (in terms of human effort) due to increased automation and AI involvement.

Comment by Stephen McAleese (stephen-mcaleese) on Solving adversarial attacks in computer vision as a baby version of general AI alignment · 2024-08-31T11:51:36.890Z · LW · GW

Nice paper! I found reading it quite insightful. Here are some key extracts from the paper:

Improving adversarial robustness by classifying several down-sampled noisy images at once:

"Drawing inspiration from biology [eye saccades], we use multiple versions of the same image at once, downsampled to lower resolutions and augmented with stochastic jitter and noise. We train a model to
classify this channel-wise stack of images simultaneously. We show that this by default yields gains in adversarial robustness without any explicit adversarial training."

Improving adversarial robustness by using an ensemble of intermediate layer predictions:

"Using intermediate layer predictions. We show experimentally that a successful adversarial
attack on a classifier does not fully confuse its intermediate layer features (see Figure 5). An
image of a dog attacked to look like e.g. a car to the classifier still has predominantly dog-like
intermediate layer features. We harness this de-correlation as an active defense by CrossMax
ensembling the predictions of intermediate layers. This allows the network to dynamically
respond to the attack, forcing it to produce consistent attacks over all layers, leading to robustness
and interpretability."

Comment by Stephen McAleese (stephen-mcaleese) on Anna and Oliver discuss Children and X-Risk · 2024-07-25T19:44:02.633Z · LW · GW

I suspect the desire for kids/lineage is really basic for a lot of people (almost everyone?)

This seems like an important point. One of the arguments for the inner alignment problem is that evolution intended to select humans for inclusive genetic fitness (IGF) but humans were instead motivated by other goals (e.g. seeking sex) that were strongly correlated with IGF in the ancestral environment.

Then when humans' environment changed (e.g. the invention of birth control), the correlation between these proxy goals and IGF broke down resulting in low fitness and inner misalignment.

However this statement seems to suggest that modern humans really have internalized IGF as one of their primary objectives and that they're inner aligned with evolution's outer objective.

Comment by Stephen McAleese (stephen-mcaleese) on Leon Lang's Shortform · 2024-07-03T22:15:53.163Z · LW · GW

I think the Zotero PDF reader has a lot of similar features that make the experience of reading papers much better:

  • It has a back button so that when you click on a reference link that takes you to the references section, you can easily click the button to go back to the text.
  • There is a highlight feature so that you can highlight parts of the text which is convenient when you want to come back and skim the paper later.
  • There is a "sticky note" feature allowing you to leave a note in part of the paper to explain something.
Comment by Stephen McAleese (stephen-mcaleese) on Boycott OpenAI · 2024-06-19T22:24:50.808Z · LW · GW

I was thinking of doing this but the ChatGPT web app seems to have many features that are only available there and add a lot of value such as Code Interpreter, PDF uploads, DALL-E, and using custom GPTs so I still use ChatGPT Plus.

Comment by Stephen McAleese (stephen-mcaleese) on We might be dropping the ball on Autonomous Replication and Adaptation. · 2024-06-01T07:21:34.314Z · LW · GW

Thank you for the blog post. I thought it was very informative regarding the risk of autonomous replication in AIs.

It seems like the Centre for AI Security is a new organization.

I've seen the announcement post on it's website. Maybe it would be a good idea to cross-post it to LessWrong as well.

Comment by Stephen McAleese (stephen-mcaleese) on MIRI 2024 Communications Strategy · 2024-05-30T19:16:37.083Z · LW · GW

Is MIRI still doing technical alignment research as well?

Comment by Stephen McAleese (stephen-mcaleese) on Talent Needs of Technical AI Safety Teams · 2024-05-27T21:15:43.150Z · LW · GW

This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.

Comment by Stephen McAleese (stephen-mcaleese) on An Overview of the AI Safety Funding Situation · 2024-05-26T09:52:33.818Z · LW · GW

Thanks for the table, it provides a good summary of the post's findings. It might also worthwhile to also add it to the EA Forum post as well.

I think the table should include the $10 million in OpenAI Superalignment fast grants as well.

Comment by Stephen McAleese (stephen-mcaleese) on How do you feel about LessWrong these days? [Open feedback thread] · 2024-05-26T09:47:03.379Z · LW · GW

I think there are some great points in this comment but I think it's overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community's culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of upvotes, so apparently they are positively received.

LessWrong these days is huge with probably over 100,000 monthly readers so I think it's challenging to summarize its culture in any particularly way (e.g. probably most users on LessWrong live outside the bay area and maybe even outside the US). I personally find that LessWrong as a whole is fairly meritocratic and not that dogmatic, and that a wide variety of views are supported provided that they are sufficiently well-argued.

In addition to LessWrong, I use some other related sites such as Twitter, Reddit, and Hacker News and although there may be problems with the discourse on LessWrong, I think it's generally significantly worse on these other sites. Even today, I'm sure you can find people saying things on Twitter about how AIs can't have goals or that wanting paperclips is stupid. These kinds of comments wouldn't be tolerated on LessWrong because they're ignorant and a waste of time. Human nature can be prone to ignorance, rigidness of opinions and so on but I think the LessWrong walled garden has been able to counteract these negative tendencies better than most other sites.

Comment by Stephen McAleese (stephen-mcaleese) on Alexander Gietelink Oldenziel's Shortform · 2024-05-15T18:45:22.499Z · LW · GW

State-of-the-art models such as Gemini aren't LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.

Comment by Stephen McAleese (stephen-mcaleese) on Catastrophic Goodhart in RL with KL penalty · 2024-05-15T13:48:04.512Z · LW · GW
  • Regularize by a function other than KL divergence. For heavy-tailed error distributions, KL divergence doesn’t work, but capping the maximum odds ratio for any action (similar to quantilizers) still results in positive utility.

A recent paper from UC Berkeley named Preventing Reward Hacking with Occupancy Measure Regularization proposes replacing KL divergence regularization with occupancy measure (OM) regularization. OM regularization involves regularizing based on the state or state-action distribution rather than the the action distribution:

"Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD [action distribution] divergence to prevent reward hacking"

The idea is that regularizing to minimize changes in the action distribution isn't always safe because small changes in the action distribution can cause large changes in the states visited by the agent:

Suppose we have access to a safe policy that drives slowly and avoids falling off the cliff. However, the car is optimizing a proxy reward function that prioritizes quickly reaching the destination, but not necessarily staying on the road. If we try to regularize the car’s action distributions to the safe policy, we will need to apply heavy regularization, since only slightly increasing the probability of some unsafe action (e.g., making a sharp right turn) can lead to disaster.

...

Our proposal follows naturally from this observation: to avoid reward hacking, regularize based on divergence from the safe policy’s occupancy measure, rather than action distribution.  A policy’s occupancy measure (OM) is the distribution of states or state-action pairs seen by a policy when it interacts with its environment.

Comment by Stephen McAleese (stephen-mcaleese) on Alexander Gietelink Oldenziel's Shortform · 2024-05-14T10:26:32.495Z · LW · GW

I just asked GPT-4 a GSM8K problem and I agree with your point. I think what's happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it's no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to "respond with just a single number" to eliminate the chain-of-thought reasoning it's problem-solving ability is much worse.

Comment by Stephen McAleese (stephen-mcaleese) on Alexander Gietelink Oldenziel's Shortform · 2024-05-14T08:29:46.613Z · LW · GW

Chain-of-thought prompting makes models much more capable. In the original paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.

I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.

Comment by Stephen McAleese (stephen-mcaleese) on We are headed into an extreme compute overhang · 2024-04-27T09:25:19.358Z · LW · GW

Currently, groups of LLM agents can collaborate using frameworks such as ChatDev, which simulates a virtual software company using LLM agents with different roles. Though I think human organizations are still more effective for now.  For example, corporations such as Microsoft have over 200,000 employees and can work on multi-year projects. But it's conceivable that in the future there could be virtual companies composed of millions of AIs that can coordinate effectively and can work continuously at superhuman speed for long periods of time.

Comment by Stephen McAleese (stephen-mcaleese) on Estimating the Current and Future Number of AI Safety Researchers · 2024-04-26T09:11:58.500Z · LW · GW

I think I might create a new post using information from this post which covers the new AI alignment landscape.

Comment by Stephen McAleese (stephen-mcaleese) on More people getting into AI safety should do a PhD · 2024-04-14T11:42:52.969Z · LW · GW

I think this section of the post is slightly overstating the opportunity cost of doing a PhD. PhD students typically spend most of their time on research so ideally, they should be doing AI safety research during the PhD (e.g. like Stephen Casper). If the PhD is in an unrelated field or for the sake of upskilling then there is a more significant opportunity cost relative to working directly for an AI safety organization.

Comment by stephen-mcaleese on [deleted post] 2024-04-13T09:50:29.593Z

Thank you for explaining PPO. In the context of AI alignment, it may be worth understanding in detail because it's the core algorithm at the heart of RLHF. I wonder if any of the specific implementation details of PPO or how it's different from other RL algorithms have implications for AI alignment. To learn more about PPO and RLHF, I recommend reading this paper: Secrets of RLHF in Large Language Models Part I: PPO.

Comment by Stephen McAleese (stephen-mcaleese) on LLMs for Alignment Research: a safety priority? · 2024-04-05T19:08:58.192Z · LW · GW

From reading the codebase, it seems to be a LangChain chatbot powered by the default LangChain OpenAI model which is gpt-3.5-turbo-instruct. The announcement blog post also says it's based on gpt-3.5-turbo.

Comment by Stephen McAleese (stephen-mcaleese) on LLMs for Alignment Research: a safety priority? · 2024-04-05T08:43:00.503Z · LW · GW

LLMs aren't that useful for alignment experts because it's a highly specialized field and there isn't much relevant training data. The AI Safety Chatbot partially solves this problem using retrieval-augmented generation (RAG) on a database of articles from https://aisafety.info. There also seem to be plans to fine-tune it on a dataset of alignment articles.

Comment by Stephen McAleese (stephen-mcaleese) on Reward is not the optimization target · 2024-04-04T22:30:41.927Z · LW · GW

OP says that this post is focused on RL policy gradient algorithms (e.g. PPO) where the RL signal is used by gradient descent to update the policy.

But what about Q-learning which is another popular RL algorithm? My understanding of Q-learning is that the policy network takes an observation as input, calculates the value (expected return) of each possible action in the state  and then chooses the action with the highest value.

Does this mean that reward is not the optimization target for policy gradient algorithms but is for Q-learning algorithms?