Posts
Comments
Given the OpenAI o3 results making it clear that you can pour more compute to solve problems, I'd like to announce that I will be mentoring at SPAR for an automated interpretability research project using AIs with inference-time compute.
I truly believe that the AI safety community is dropping the ball on this angle of technical AI safety and that this work will be a strong precursor of what's to come.
Note that this work is a small part in a larger organization on automated AI safety I’m currently attempting to build.
Here’s the pitch:
As AIs become more capable, they will increasingly be used to automate AI R&D. Given this, we should seek ways to use AIs to help us also make progress on alignment research.
Eventually, AIs will automate all research, but for now, we need to choose specific tasks that AIs can do well on. The kind of problems we can expect AIs will be good at fairly soon are the kind that have reliable metrics they can optimize, have a lot of knowledge about, and can iterate on fairly cheaply.
As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate. For now, we can leave the exact details a bit broad, but here are some examples of what we could attempt to use AIs to make deep learning models more interpretable:
-
Optimizing Sparse Autoencoders (SAEs): sparse autoencoders (or transcoders) can be used to help us interpret the features of deep learning models. However, SAEs may suffer from issues like polysemanticity. Our goal is to create a SAE training setup that can give us some insight into what might make AI models more interpretable. This could involve testing different regularizers, activation functions, and more. We'll start with simpler vision models before scaling to language models to allow for rapid iteration and validation. Key metrics include feature monosemanticity, sparsity, dead feature ratios, and downstream task performance.
-
Enhancing Model Editability: we will be using AIs to do experiments on language models to find out which modifications lead to better model editing ability from a technique like ROME/MEMIT.
Overall, we can also use other approaches to measure the increase in interpretability (or editability) of language models.
The project aims to answer several key questions:
- Can AI effectively optimize interpretability techniques?
- What metrics best capture meaningful improvements in interpretability?
- Are AIs better at this task than human researchers?
- Can we develop reliable pipelines for automated interpretability research?
Initial explorations will focus on creating clear evaluation frameworks and baselines, starting with smaller-scale proof-of-concepts that can be rigorously validated.
References:
- "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery" (Lu et al., 2024)
- "RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts" (METR, 2024)
- "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" (Bricken et al., 2023)
- "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" (Templeton et al., 2024)
- "ROME: Locating and Editing Factual Knowledge in GPT" (Meng et al., 2022) Briefly, how does your project advance AI safety? (from Proposal)
The goal of this project is to leverage AIs to progress on the interpretability of deep learning models. Part of the project will involve building infrastructure to help AIs contribute to alignment research more generally, which will be re-used as models become more capable of making progress on alignment. Another part will look to improve the interpretability of deep learning models without sacrificing capability. What role will mentees play in this project? (from Proposal) Mentees will be focused on:
- Get up-to-date on current approaches to leverage AIs for automated research.
- Setting up the infrastructure to get AIs to automate the interpretability research.
- Run experiments with AIs to optimize for making models more interpretable while not compromising on capabilities.
(Reposted from Facebook)
Hey Weibing Wang! Thanks for sharing. I just started skimming your paper, and I appreciate the effort you put into this; it combines many of the isolated work people have been working on.
I also appreciate your acknowledgement that your proposed solution has not undergone experimental validation, humility, and the suggestion that these proposed solutions need to be tested and iterated upon as soon as possible due to the practicalities of the real world.
I want to look into your paper again when I have time, but some quick comments:
- You might be interested in reading up Drexler's CAIS stuff:
https://www.lesswrong.com/tag/ai-services-cais?sortedBy=new https://www.lesswrong.com/posts/LxNwBNxXktvzAko65/reframing-superintelligence-llms-4-years
-
You should make the paper into a digestible format of sub-projects you can post to find collaborators to make progress on to verify some parts experimentally and potentially collaborate with some governance folks to turn some of your thoughts into a report that will get the eyeballs of important people on it.
-
Need more technical elaboration on the "how" to do x, not just "what" needs to be done.
We still don't know if this will be guaranteed to happen, but it seems that OpenAI is considering removing its "regain full control of Microsoft shares once AGI is reached" clause. It seems they want to be able to keep their partnership with Microsoft (and just go full for-profit (?)).
Here's the Financial Times article:
OpenAI seeks to unlock investment by ditching ‘AGI’ clause with Microsoft
OpenAI is in discussions to ditch a provision that shuts Microsoft out of its most advanced models when the start-up achieves “artificial general intelligence”, as it seeks to unlock billions of dollars of future investment.
Under current terms, when OpenAI creates AGI — defined as a “highly autonomous system that outperforms humans at most economically valuable work” — Microsoft’s access to such a technology would be void. The OpenAI board would determine when AGI is achieved.
The start-up is considering removing the stipulation from its corporate structure, enabling the Big Tech group to continue investing in and accessing all OpenAI technology after AGI is achieved, according to multiple people with knowledge of the discussions. A final decision has not been made and options are being discussed by the board, they added.
The clause was included to protect the potentially powerful technology from being misused for commercial purposes, giving ownership of the technology to its non-profit board. According to OpenAI’s website: “AGI is explicitly carved out of all commercial and IP licensing agreements.”
But the provision potentially limits the value of its partnership for Microsoft, which has pumped more than $13bn into OpenAI, and could disincentivise the Big Tech group from further investment.
More funding will be needed given the eye-watering costs involved in developing advanced AI models in a race against deep-pocketed rivals such as Google and Amazon.
The San Francisco-based group led by Sam Altman, which was recently valued at $150bn, is currently restructuring to become a public benefit corporation. That move represents a departure from its origins as a not-for-profit research lab.
As part of the changes, OpenAI is discussing new terms with investors, including its largest shareholder Microsoft, according to multiple people familiar with the conversations.
“When we started, we had no idea we were going to be a product company or that the capital we needed would turn out to be so huge,” Altman told a New York Times conference on Wednesday. “If we knew those things, we would have picked a different structure.”
“We’ve also said that our intention is to treat AGI as a mile marker along the way. We’ve left ourselves some flexibility because we don’t know what will happen,” added Altman, who could receive a direct equity stake in OpenAI for the first time as part of the restructure.
Increasingly, people at OpenAI have moved away from defining AGI as a single point, instead emphasising it is a continuous process and will be defined by wider society.
OpenAI began raising outside capital in 2019, receiving a $1bn investment from Microsoft that year. At the time, the company said it intended “to license some of our pre-AGI technologies” to Microsoft to cover the costs of developing cutting-edge AI.
OpenAI has advised backers to consider their investments “in the spirit of a donation, with the understanding that it may be difficult to know what role money will play in a post-AGI world”.
But its steady move to becoming a for-profit entity has received strong criticism from rivals, including Elon Musk, an early backer and co-founder of OpenAI.
The billionaire Tesla chief, who has since founded a rival start-up xAI, recently filed a lawsuit against OpenAI and Microsoft, accusing Altman of “deceit of Shakespearean proportions” and seeking to void its commercial partnership with Microsoft.
As part of the proposed restructuring, the ChatGPT-maker will also retain an independent not-for-profit entity, which would have a stake in the new public benefit corporation and potentially a trust, according to people familiar with the discussions. The not-for-profit would have access to research and technology but solely focus on pursuing OpenAI’s mission of benefiting humanity.
OpenAI declined to comment on the specifics of negotiations around the restructuring but Bret Taylor, chair of OpenAI’s board, said the board of directors of the non-profit “is focused on fulfilling our fiduciary obligation by ensuring that the company is well-positioned to continue advancing its mission of ensuring AGI benefits all of humanity”.
He added: “While our work remains ongoing as we continue to consult independent financial and legal advisers, any potential restructuring would ensure the non-profit continues to exist and thrive, and receives full value for its current stake in the OpenAI for-profit with an enhanced ability to pursue its mission.”
Microsoft declined to comment.
Regarding coding in general, I basically only prompt programme these days. I only bother editing the actual code when I notice a persistent bug that the models are unable to fix after multiple iterations.
I don't know jackshit about web development and have been making progress on a dashboard for alignment research with very little effort. Very easy to build new projects quickly. The difficulty comes when there is a lot of complexity in the code. It's still valuable to understand how high-level things work and low-level things the model will fail to proactively implement.
I'd be down to do this. Specifically, I want to do this, but I want to see if the models are qualitatively better at alignment research tasks.
In general, what I'm seeing is that there is not big jump with o1 Pro. However, it is possibly getting closer to one-shot a website based on a screenshot and some details about how the user likes their backend setup.
In the case of math, it might be a bigger jump (especially if you pair it well with Sonnet).
I sent an invite, Logan! :)
Shameless self-plug: Similarly, if anyone wants to discuss automating alignment research, I'm in the process of building an organization to make that happen. I'm reaching out to Logan because I have a project in mind regarding automating interpretability research (e.g. making AIs run experiments that try to make DL models more interpretable), and he's my friend! My goal for the org is to turn it into a three-year moonshot to solve alignment. I'd be happy to chat with anyone who would be interested in chatting further about this (I'm currently testing fit with potential co-founders and seeking a cracked basement CTO).
I have some alignment project ideas for things I'd consider mentoring for. I would love feedback on the ideas. If you are interested in collaborating on any of them, that's cool, too.
Here are the titles:
Smart AI vs swarm of dumb AIs |
Lit review of chain of thought faithfulness (steganography in AIs) |
Replicating METR paper but for alignment research task |
Tool-use AI for alignment research |
Sakana AI for Unlearning |
Automated alignment onboarding |
Build the infrastructure for making Sakana AI's AI scientist better for alignment research |
I’d be curious to know if there’s variability in the “hours worked per week” given that people might work more hours during a short program vs a longer-term job (to keep things sustainable).
Imagine there was an AI-suggestion tool that could predict reasons why you agree/disagree-voted on a comment, and you just had to click one of the generated answers to provide a bit of clarity at a low cost.
Completely agree. I remember a big shift in my performance when I went from "I'm just using programming so that I can eventually build a startup, where I'll eventually code much less" to "I am a programmer, and I am trying to become exceptional at it." The shift in mindset was super helpful.
This is one of the reasons I think 'independent' research is valuable, even if it isn't immediately obvious from a research output (papers, for example) standpoint.
That said, I've definitely had the thought, "I should niche down into a specific area where there is already a bunch of infrastructure I can leverage and churn out papers with many collaborators because I expect to be in a more stable funding situation as an independent researcher. It would also make it much easier to pivot into a role at an organization if I want to or necessary. It would definitely be a much more stable situation for me."(And I also agree that specialization is often underrated.)
Ultimately, I decided not to do this because I felt like there were already enough people in alignment/governance who would take the above option due to financial and social incentives and published directions seeming more promising. However, since this makes me produce less output, I hope this is something grantmakers keep in consideration for my future grant applications.
I think it's up to you and how you write. English isn't my first language, so I've found it useful. I also don't accept like 50% of the suggestions. But yeah, looking at the plan now, I think I could get off the Pro plan and see if I'm okay not paying for it.
It's definitely not the thing I care about most on the list.
There are multiple courses, though it's fairly new. They have one on full-stack development (while using Cursor and other things) and Replit Agents. I've been following it to learn fast web development, and I think it's a good starting point for getting an overview of building an actual product on a website you can eventually sell or get people to use.
Somewhat relevant blog post by @NunoSempere: https://nunosempere.com/blog/2024/09/10/chance-your-startup-will-succeed/
As an aside, I have considered that samplers were underinvestigated and that they would lead to some capability boosts. It's also one of the things I'd consider testing out to improve LLMs for automated/augmented alignment research.
The importance of Entropy
Given that there's been a lot of talk about using entropy during sampling of LLMs lately (related GitHub), I figured I'd share a short post I wrote for my website before it became a thing:
Imagine you're building a sandcastle on the beach. As you carefully shape the sand, you're creating order from chaos - this is low entropy. But leave that sandcastle for a while, and waves, wind, and footsteps will eventually reduce it back to a flat, featureless beach - that's high entropy.
Entropy is nature's tendency to move from order to disorder, from concentration to dispersion. It's why hot coffee cools down, why ice cubes melt in your drink, and why it's easier to make a mess than to clean one up. In the grand scheme of things, entropy is the universe's way of spreading energy out evenly, always moving towards a state of balance or equilibrium.
Related to entropy, the Earth radiates back approximately the same energy the Sun radiates towards it. The Sun radiates fewer photons at a higher energy wavelength (mostly visible and near-infrared) than the Earth, which radiates way more photons, but each photon has much lower energy (mostly infrared).
If the Earth didn't radiate back the same energy, the Earth would heat up continuously, which would obviously be unstable.
The cool thing is that Entropy (the tendency of energy to spread out, ex: the universe expanding or fart spreading across the room) is possibly what made life happen, and it was necessary to have a constant stream of low-entropy energy (high-energy photon packets) coming from the Sun.
If you have a constant stream of low-entropy energy from the Sun, it may favour structures that dissipate energy, thereby increasing Entropy (keeping the energy constant while spreading it). Entropy is an important ingredient in the emergence of life, how we ended up going from random clumps of atoms to plants to many complex organisms on Earth.
Dissipative structures: Living organisms are complex systems that maintain their organization by dissipating energy and matter. They take in low-entropy energy (sunlight/food) and release higher-entropy energy (heat), increasing the universe's entropy while maintaining their own order.
Life isn't just an accident but potentially an inevitable consequence of thermodynamics. Organisms can be thought of as highly efficient entropy producers, accelerating the universe's march toward maximum entropy while creating local pockets of increased order and complexity.
The emergence of life might be a natural result of physical laws, occurring wherever conditions allow for the formation of systems that can effectively dissipate energy.
One thing I'd like to ponder more about: if entropy is a necessary component for the emergence of life, what could it mean for AI? Due to entropy, the world has been biased towards increasingly complex organisms. How does that trend impact the future of the universe? Will we see an unprecedented acceleration of the universe's march toward maximum entropy?
Fair enough. For what it's worth, I've thought a lot about the kind of thing you describe in that comment and partially committing to this direction because I feel like I have enough intuition and insight that those other tools for thought failed to incorporate.
Just to clarify, do you only consider 'strong human intelligence amplification' through some internal change, or do you also consider AIs to be part of that? As in, it sounds like you are saying we currently lack the intelligence to make significant progress on alignment research and consider increasing human intelligence to be the best way to make progress. Are you also of the opinion that using AIs to augment alignment researchers and progressively automate alignment research is doomed and not worth consideration? If not, then here.
I'm in the process of trying to build an org focused on "automated/augmented alignment research." As part of that, I've been thinking about which alignment research agendas could be investigated in order to make automated alignment safer and trustworthy. And so, I've been thinking of doing internal research on AI control/security and using that research internally to build parts of the system I intend to build. I figured this would be a useful test case for applying the AI control agenda and iterating on issues we face in implementation, and then sharing those insights with the wider community.
Would love to talk to anyone who has thoughts on this or who would introduce me to someone who would fund this kind of work.
I quickly wrote up some rough project ideas for ARENA and LASR participants, so I figured I'd share them here as well. I am happy to discuss these ideas and potentially collaborate on some of them.
Alignment Project Ideas (Oct 2, 2024)
1. Improving "A Multimodal Automated Interpretability Agent" (MAIA)
Overview
MAIA (Multimodal Automated Interpretability Agent) is a system designed to help users understand AI models by combining human-like experimentation flexibility with automated scalability. It answers user queries about AI system components by iteratively generating hypotheses, designing and running experiments, observing outcomes, and updating hypotheses.
MAIA uses a vision-language model (GPT-4V, at the time) backbone equipped with an API of interpretability experiment tools. This modular system can address both "macroscopic" questions (e.g., identifying systematic biases in model predictions) and "microscopic" questions (e.g., describing individual features) with simple query modifications.
This project aims to improve MAIA's ability to either answer macroscopic questions or microscopic questions on vision models.
2. Making "A Multimodal Automated Interpretability Agent" (MAIA) work with LLMs
MAIA is focused on vision models, so this project aims to create a MAIA-like setup, but for the interpretability of LLMs.
Given that this would require creating a new setup for language models, it would make sense to come up with simple interpretability benchmark examples to test MAIA-LLM. The easiest way to do this would be to either look for existing LLM interpretability benchmarks or create one based on interpretability results we've already verified (would be ideal to have a ground truth). Ideally, the examples in the benchmark would be simple, but new enough that the LLM has not seen them in its training data.
3. Testing the robustness of Critique-out-Loud Reward (CLoud) Models
Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.
The goal for this project would be to test the robustness of CLoud reward models. For example, are the CLoud RMs (discriminators) more robust to jailbreaking attacks from the policy (generator)? Do the CLoud RMs generalize better?
From an alignment perspective, we would want RMs that generalize further out-of-distribution (and ideally, always more than the generator we are training).
4. Synthetic Data for Behavioural Interventions
Simple synthetic data reduces sycophancy in large language models by (Google) reduced sycophancy in LLMs with a fairly small number of synthetic data examples. This project would involve testing this technique for other behavioural interventions and (potentially) studying the scaling laws. Consider looking at the examples from the Model-Written Evaluations paper by Anthropic to find some behaviours to test.
5. Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hidden away the superposition in other parts of the network, making SoLU unhelpful in making the models more interpretable
That said, we hope to find that we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Methodology:
- Identify a set of regularization techniques (e.g., L1 regularization, weight pruning, activation sparsity) to be applied during fine-tuning.
- Fine-tune pre-trained language models with different regularization techniques and hyperparameters.
- Evaluate the fine-tuned models using interpretability tools (e.g., attention visualization, probing classifiers) and editability benchmarks (e.g., ROME).
- Analyze the impact of regularization on model interpretability, editability, and performance.
- Investigate the relationship between interpretability, editability, and model alignment.
Expected Outcomes:
- Quantitative assessment of the effectiveness of different regularization techniques for improving interpretability and editability.
- Insights into the trade-offs between interpretability, editability, and model performance.
- Recommendations for regularization techniques that enhance interpretability and editability while maintaining model performance and alignment.
6. Quantifying the Impact of Reward Misspecification on Language Model Behavior
Investigate how misspecified reward functions influence the behavior of language models during fine-tuning and measure the extent to which the model's outputs are steered by the reward labels, even when they contradict the input context. We hope to better understand language model training dynamics. Additionally, we expect online learning to complicate things in the future, where models will be able to generate the data they may eventually be trained on. We hope that insights from this work can help us prevent catastrophic feedback loops in the future. For example, if model behavior is mostly impacted by training data, we may prefer to shape model behavior through synthetic data (it has been shown we can reduce sycophancy by doing this).
Prior works:
- The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models by Alexander Pan, Kush Bhatia, Jacob Steinhardt
- Survival Instinct in Offline Reinforcement Learning by Anqi Li, Dipendra Misra, Andrey Kolobov, Ching-An Cheng
- Simple synthetic data reduces sycophancy in large language models by (Google), Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le
- Scaling Laws for Reward Model Overoptimization by (OpenAI), Leo Gao, John Schulman, Jacob Hilton
- On the Sensitivity of Reward Inference to Misspecified Human Models by Joey Hong, Kush Bhatia, Anca Dragan
Methodology:
- Create a diverse dataset of text passages with candidate responses and manually label them with coherence and misspecified rewards.
- Fine-tune pre-trained language models using different reward weighting schemes and hyperparameters.
- Evaluate the generated responses using automated metrics and human judgments for coherence and misspecification alignment.
- Analyze the influence of misspecified rewards on model behavior and the trade-offs between coherence and misspecification alignment.
- Use interpretability techniques to understand how misspecified rewards affect the model's internal representations and decision-making process.
Expected Outcomes:
- Quantitative measurements of the impact of reward misspecification on language model behavior.
- Insights into the trade-offs between coherence and misspecification alignment.
- Interpretability analysis revealing the effects of misspecified rewards on the model's internal representations.
7. Investigating Wrong Reasoning for Correct Answers
Understand the underlying mechanisms that lead to language models producing correct answers through flawed reasoning, and develop techniques to detect and mitigate such behavior. Essentially, we want to apply interpretability techniques to help us identify which sets of activations or token-layer pairs impact the model getting the correct answer when it has the correct reasoning versus when it has the incorrect reasoning. The hope is to uncover systematic differences as to when it is not relying on its chain-of-thought at all and when it does leverage its chain-of-thought to get the correct answer.
[EDIT Oct 2nd, 2024] This project intends to follow a similar line of reasoning as described in this post and this comment. The goal is to study chains-of-thought and improve faithfulness without suffering an alignment tax so that we can have highly interpretable systems through their token outputs and prevent loss of control. The project doesn't necessarily need to rely only on model internals.
Related work:
- Decomposing Predictions by Modeling Model Computation by Harshay Shah, Andrew Ilyas, Aleksander Madry
- Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models by Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
- On Measuring Faithfulness or Self-consistency of Natural Language Explanations by Letitia Parcalabescu, Anette Frank
- Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting by Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman
- Measuring Faithfulness in Chain-of-Thought Reasoning by Tamera Lanham et al.
Methodology:
- Curate a dataset of questions and answers where language models are known to provide correct answers but with flawed reasoning.
- Use interpretability tools (e.g., attention visualization, probing classifiers) to analyze the model's internal representations and decision-making process for these examples.
- Develop metrics and techniques to detect instances of correct answers with flawed reasoning.
- Investigate the relationship between model size, training data, and the prevalence of flawed reasoning.
- Propose and evaluate mitigation strategies, such as data augmentation or targeted fine-tuning, to reduce the occurrence of flawed reasoning.
Expected Outcomes:
- Insights into the underlying mechanisms that lead to correct answers with flawed reasoning in language models.
- Metrics and techniques for detecting instances of flawed reasoning.
- Empirical analysis of the factors contributing to flawed reasoning, such as model size and training data.
- Proposed mitigation strategies to reduce the occurrence of flawed reasoning and improve model alignment.
I'm exploring the possibility of building an alignment research organization focused on augmenting alignment researchers and progressively automating alignment research (yes, I have thought deeply about differential progress and other concerns). I intend to seek funding in the next few months, and I'd like to chat with people interested in this kind of work, especially great research engineers and full-stack engineers who might want to cofound such an organization. If you or anyone you know might want to chat, let me know! Send me a DM, and I can send you some initial details about the organization's vision.
Here are some things I'm looking for in potential co-founders:
Need
- Strong software engineering skills
Nice-to-have
- Experience in designing LLM agent pipelines with tool-use
- Experience in full-stack development
- Experience in scalable alignment research approaches (automated interpretability/evals/red-teaming)
Given today's news about Mira (and two other execs leaving), I figured I should bump this again.
But also note that @Zach Stein-Perlman has already done some work on this (as he noted in his edit): https://ailabwatch.org/resources/integrity/.
Note, what is hard to pinpoint when it comes to S.A. is that many of the things he does have been described as "papercuts". This is the kind of thing that makes it hard to make a convincing case for wrongdoing.
And while flattering to Brockman, there is nothing about Murati - free tip to all my VC & DL startup acquaintances, there's a highly competent AI manager who's looking for exciting new opportunities, even if she doesn't realize it yet.
Heh, here it is: https://x.com/miramurati/status/1839025700009030027
I completely agree, and we should just obviously build an organization around this. Automating alignment research while also getting a better grasp on maximum current capabilities (and a better picture of how we expect it to grow).
(This is my intention, and I have had conversations with Bogdan about this, but I figured I'd make it more public in case anyone has funding or ideas they would like to share.)
Here's what I'm currently using and how much I am paying:
- Superwhisper (or other new Speech-to-Text that leverage "LLMs for rewriting" apps). Under $8.49 per month. You can use different STT models (different speed and accuracy for each) and LLM for rewriting the transcript based on a prompt you give the models. You can also have different "modes", meaning that you can have the model take your transcript and write code instructions in a pre-defined format when you are in an IDE, turn a transcript into a report when writing in Google Docs, etc. There is also an iOS app.
- Cursor Pro ($20-30/month). Switch to API credits when the slow responses take too long. (You can try Zed (an IDE) too if you want. I've only used it a little bit, but Anthropic apparently uses it and there's an exclusive "fast-edit" feature with the Anthropic models.)
- Claude.ai Pro ($20/month). You could consider getting two accounts or a Team account to worry less about hitting the token limit.
- Chatgpt.com Pro account ($20/month). Again, can get a second account to have more o1-preview responses from the chat.
- Aider (~$10/month max in API credits if used with Cursor Pro).
- Google Colab Pro subscription ($9.99/month). You could get the Pro+ plan for $49.99/month.
- Google One 2TB AI Premium plan ($20/month). This comes with Gemini chat and other AI features. I also sign up to get the latest features earlier, like Notebook LM and Illuminate.
- v0 chat ($20/month). Used for creating Next.js websites quickly.
- jointakeoff.com ($22.99/month) for courses on using AI for development.
- I still have GitHub Copilot (along with Cursor's Copilot++) because I bought a long-term subscription.
- Grammarly ($12/month).
- Reader by ElevenLabs (Free, for now). Best quality TTS app out there right now.
Other things I'm considering paying for:
- Perplexity AI ($20/month).
- Other AI-focused courses that help me best use AI for productivity (web dev or coding in general).
- Suno AI ($8/month). I might want to make music with it.
Apps others may be willing to pay for:
- Warp, an LLM-enabled terminal ($20/month). I don't use the free version enough to upgrade to the paid version.
There are ways to optimize how much I'm paying to save a bit of cash for sure. But I'm currently paying roughly $168/month.
That said, I am also utilizing research credits from Anthropic, which could range from $500 to $2000 depending on the month. In addition, I'm working on an "alignment research assistant" which will leverage LLMs, agents, API calls to various websites, and more. If successful, I could see this project absorbing hundreds of thousands in inference costs.
Note: I am a technical alignment researcher who also works on trying to augment alignment researchers and eventually automate more and more of alignment research so I'm biasing myself to overspend on products in order to make sure I'm aware of the bleeding-edge setup.
News on the next OAI GPT release:
Nagasaki, CEO of OpenAI Japan, said, "The AI model called 'GPT Next' that will be released in the future will evolve nearly 100 times based on past performance. Unlike traditional software, AI technology grows exponentially."
https://www.itmedia.co.jp/aiplus/articles/2409/03/news165.html
The slide clearly states 2024 "GPT Next". This 100 times increase probably does not refer to the scaling of computing resources, but rather to the effective computational volume + 2 OOMs, including improvements to the architecture and learning efficiency. GPT-4 NEXT, which will be released this year, is expected to be trained using a miniature version of Strawberry with roughly the same computational resources as GPT-4, with an effective computational load 100 times greater. Orion, which has been in the spotlight recently, was trained for several months on the equivalent of 100k H100 compared to GPT-4 (EDIT: original tweet said 10k H100s, but that was a mistake), adding 10 times the computational resource scale, making it +3 OOMs, and is expected to be released sometime next year.
Note: Another OAI employee seemingly confirms this (I've followed them for a while, and they are working on inference).
- IMO if you end up integrating something like this in LW I think it would be net positive. Specially if you can link it to @stampy or similar to ask for clarification questions about concepts, ...
I was thinking of linking it to an Alignment Research Assistant I've been working on, too.
I just started using this extension, but basically, every time I'm about to read a long post, I feed it and all the comments to Claude chat. The question-flow is often:
- What are the key points of the post?
- (Sometimes) Explain x in more detail in relation to y or some specific clarification questions.
- What are the key criticisms of this post based on the comments?
- How does the author respond to those criticisms?
- (Sometimes) Follow-up questions about the post.
Easy LessWrong post to LLM chat pipeline (browser side-panel)
I started using Sider as @JaimeRV recommended here. Posting this as a top-level shortform since I think other LessWrong users should be aware of it.
Website with app and subscription option. Chrome extension here.
You can either pay for the monthly service and click the "summarize" feature on a post and get the side chat window started or put your OpenAI API / ChatGPT Pro account in the settings and just cmd+a the post (which automatically loads the content in the chat so you can immediately ask a question; "explain the key points of the post", "help me really understand what deep deceptiveness means").
Afaict, it only works with Sonnet-3.5 through the paid subscription.
Thanks for sharing, will give it a shot!
Edit: Sider seems really great! I wish it could connect to Claude chat (without using credits), so I will probably just use both extensions.
Low-hanging fruit:
Loving this Chrome extension so far: YouTube Summary with ChatGPT & Claude - Chrome Web Store
It adds a button on YouTube videos where, when you click it (or keyboard shortcut ctrl + x + x), it opens a new tab into the LLM chat of your choice, pastes the entire transcript in the chat along with a custom message you can add as a template ("Explain the key points.") and then automatically presses enter to get the chat going.
It's pretty easy to get a quick summary of a YouTube video without needing to watch the whole thing and then ask follow-up questions. It seems like an easy way to save time or do a quick survey of many YouTube videos. (I would not have bothered going through the entire "Team 2 | Lo fi Emulation @ Whole Brain Emulation Workshop 2024" talk, so it was nice to get the quick summary.)
I usually like getting a high-level overview of the key points of a talk to have a mental mind map skeleton before I dive into the details.
You can even set up follow-up prompt buttons (which works with ChatGPT but currently does not work with Claude for me), though I'm not sure what I'd use. Maybe something like, "Why is this important to AI alignment?"
The default prompt is "Give a summary in 5 bullet points" or something similar. I prefer not to constrain Claude and change it to something like, "Explain the key points."
Synthesized various resources for this "pre-training for alignment" type work:
- Data
- Synthetic Data
- The RetroInstruct Guide To Synthetic Text Data
- Alignment In The Age of Synthetic Data
- Leveraging Agentic AI for Synthetic Data Generation
- **AutoEvol**: Automatic Instruction Evolving for Large Language Models We build a fully automated Evol-Instruct pipeline to create high-quality, highly complex instruction tuning data
- Synthetic Data Generation and AI Feedback notebook
- The impact of models training on their own outputs and how its actually done well in practice
- Google presents Best Practices and Lessons Learned on Synthetic Data for Language Models
- Transformed/Enrichment of Data
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web!
- Better Synthetic Data by Retrieving and Transforming Existing Datasets
- Rho-1: Not All Tokens Are What You Need RHO-1-1B and 7B achieves SotA results of 40.6% and 51.8% on MATH dataset, respectively — matching DeepSeekMath with only 3% of the pretraining tokens.
- Data Attribution
- In-Run Data Shapley
- Scaling Laws for the Value of Individual Data Points in Machine Learning We show how some data points are only valuable in small training sets; others only shine in large datasets.
- What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
- Data Mixtures
- Methods for finding optimal data mixture
- Curriculum Learning
- Active Data Selection
- MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models MATES significantly elevates the scaling curve by selecting the data based on the model's evolving needs.
- Data Filtering
- Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic Argues that data curation cannot be agnostic of the total compute that a model will be trained for Github
- How to Train Data-Efficient LLMs Models trained on ASK-LLM data consistently outperform full-data training—even when we reject 90% of the original dataset, while converging up to 70% faster
- Synthetic Data
- On Pre-Training
- Pre-Training from Human Preferences
- Ethan Perez wondering if jailbreaks would be solved with this pre-training approach
- LAION uses this approach for finegrained control over outputs during inference.
- Nora Belrose thinks that alignment via pre-training would make models more robust to unlearning (she doesn't say this, but this may be a good thing if you pre-train such that you don't need unlearning)
- Tomek describing some research direction for improving pre-training alignment
- Simple and Scalable Strategies to Continually Pre-train Large Language Models
- Neural Networks Learn Statistics of Increasing Complexity
- Pre-Training from Human Preferences
- Pre-Training towards the basin of attraction for alignment
- Alignment techniques
- AlignEZ: Using the self-generated preference data, we identify the subspaces that: (1) facilitate and (2) are harmful to alignment. During inference, we surgically modify the LM embedding using these identified subspaces. Jacques note: could we apply this iteratively throughout training (and other similar methods)?
- What do we mean by "alignment"? What makes the model safe?
- Values
- On making the model "care"
This just got some massive downvotes. Would like to know why. My guess is "This can be dual-use. Therefore, it's bad," but if not, it would be nice to know.
I believe that he says there is a special quality to the olive oil he chose and that the average bottle of olive oil does not provide the claimed ideal benefits of olive oil for some reason. I'm not sure how true this is and how much he is marking up the price, even if it is true.
Maybe, I haven’t compared the prices (I think he says it’s similar to the quality you would get from whole foods at a grocery store), but he gives all of the recipes for free if people want to do them at home.
he's unmarried (though has 3 kids, I don't know his involvement)
He is divorced, and one of his sons currently lives with him (also left Mormonism), at least for this year and maybe indefinitely. The rest of the family is still into Mormonism, and his wife tried to sue him for millions, and she lost (false accusations). It is unclear if he interacts much with the other children.
Evidence his system can motivate and provide superior results to other diet-and-exercise regimens on the basis of his own personal results is, of course, massively confounded.
He encourages people to measure things for themselves and not follow recommendations blindly. When he does give recommendations for things like sleep, he mostly suggests things that are basically free. The only expensive thing is the Whoop sleep tracker, which he considers important for figuring out what works for each individual.
Hey everyone, in collaboration with Apart Research, I'm helping organize a hackathon this weekend to build tools for accelerating alignment research. This hackathon is very much related to my effort in building an "Alignment Research Assistant."
Here's the announcement post:
2 days until we revolutionize AI alignment research at the Research Augmentation Hackathon!
As AI safety researchers, we pour countless hours into crucial work. It's time we built tools to accelerate our efforts! Join us in creating AI assistants that could supercharge the very research we're passionate about.
Date: July 26th to 28th, online and in-person
Prizes: $2,000 in prizes
Why join?
* Build tools that matter for the future of AI
* Learn from top minds in AI alignment
* Boost your skills and portfolio
We've got a Hackbook with an exciting project to work on waiting for you! No advanced AI knowledge required - just bring your creativity!
Register now: Sign up on the website here, and don't miss this chance to shape the future of AI research!
Yeah, I was thinking about using SAD. The main issue is that for non-AGI-lab-sized models, you'll have a tough time eliciting SA. However, we could potentially focus on precursor capabilities and such.
If you are concerned about capabilities like SA, then you might ask yourself, "it seems highly unlikely that you can figure out which data points impact SA the most because it will likely be a mix of many things and each data point will play a role in accumulating to SA." My guess is that you can break down SA into enough precursor capabilities that this approach can still be highly predictive even if it isn't 100%/
I think forcing them to retrieve in-context sounds good, but I also think labs may not want this, not sure. Basically, they'll want to train things into the model eventually, like for many CoT things.
Agreed on having a validation set for reducing the alignment tax.
Why aren't you doing research on making pre-training better for alignment?
I was on a call today, and we talked about projects that involve studying how pre-trained models evolve throughout training and how we could guide the pre-training process to make models safer. For example, could models trained on synthetic/transformed data make models significantly more robust and essentially solve jailbreaking? How about the intersection of pretraining from human preferences and synthetic data? Could the resulting model be significantly easier to control? How would it impact the downstream RL process? Could we imagine a setting where we don't need RL (or at least we'd be able to confidently use resulting models to automate alignment research)? I think many interesting projects could fall out of this work.
So, back to my main question: why aren't you doing research on making pre-training better for alignment? Is it because it's too expensive and doesn't seem like a low-hanging fruit? Or do you feel it isn't a plausible direction for aligning models?
We were wondering if there are technical bottlenecks that would make this kind of research more feasible for alignment research to better study how to guide the pretraining process in a way that benefits alignment. As in, would researchers be more inclined to do experiments in this direction if the entire pre-training code was handled and you'd just have to focus on whatever specific research question you have in mind? If we could access a large amount of compute (let's say, through government resources) to do things like data labeling/filtering and pre-training multiple models, would this kind of work be more interesting for you to pursue?
I think many alignment research directions have grown simply because they had low-hanging fruits that didn't require much compute (e.g., evals, and mech interp). It seems we've implicitly left all of the high-compute projects for the AGI labs to figure out. But what if we weren't as bottlenecked on this anymore? It's possible to retrain GPT-2 1.5B with under 700$ now (and 125M for 20$). I think we can find ways to do useful experiments, but my guess is that the level of technical expertise required to get it done is a bit high, and alignment researchers would rather avoid these kinds of projects since they are currently high-effort.
I talk about other related projects here.
We're doing a hackathon with Apart Research on 26th. I created a list of problem statements for people to brainstorm off of.
Pro-active insight extraction from new research
Reading papers can take a long time and is often not worthwhile. As a result, researchers might read too many papers or almost none. However, there are still valuable nuggets in papers and posts. The issue is finding them. So, how might we design an AI research assistant that proactively looks at new papers (and old) and shares valuable information with researchers in a naturally consumable way? Part of this work involves presenting individual research with what they would personally find valuable and not overwhelm them with things they are less interested in.
How can we improve the LLM experience for researchers?
Many alignment researchers will use language models much less than they would like to because they don't know how to prompt the models, it takes time to create a valuable prompt, the model doesn't have enough context for their project, the model is not up-to-date on the latest techniques, etc. How might we make LLMs more useful for researchers by relieving them of those bottlenecks?
Simple experiments can be done quickly, but turning it into a full project can take a lot of time
One key bottleneck for alignment research is transitioning from an initial 24-hour simple experiment in a notebook to a set of complete experiments tested with different models, datasets, interventions, etc. How can we help researchers move through that second research phase much faster?
How might we use AI agents to automate alignment research?
As AI agents become more capable, we can use them to automate parts of alignment research. The paper "A Multimodal Automated Interpretability Agent" serves as an initial attempt at this. How might we use AI agents to help either speed up alignment research or unlock paths that were previously inaccessible?
How can we nudge research toward better objectives (agendas or short experiments) for their research?
Even if we make researchers highly efficient, it means nothing if they are not working on the right things. Choosing the right objectives (projects and next steps) through time can be the difference between 0x to 1x to +100x. How can we ensure that researchers are working on the most valuable things?
What can be done to accelerate implementation and iteration speed?
Implementation and iteration speed on the most informative experiments matter greatly. How can we nudge them to gain the most bits of information in the shortest time? This involves helping them work on the right agendas/projects and helping them break down their projects in ways that help them make progress faster (and avoiding ending up tunnel-visioned on the wrong project for months/years).
How can we connect all of the ideas in the field?
How can we integrate the open questions/projects in the field (with their critiques) in such a way that helps the researcher come up with well-grounded research directions faster? How can we aid them in choosing better directions and adjust throughout their research? This kind of work may eventually be a precursor to guiding AI agents to help us develop better ideas for alignment research.
Good to know! So I guess people were expecting that every company is running a “check if canary string is anywhere in our entire dataset and remove document if so” function?
If you just google the string, there are many instances of people sharing it verbatim. Would be good to do further testing to know if it was actually trained on the benchmark or learned through many other sources.
I sent some related project ideas to @RogerDearnaley via DMs, but figured I should share them here to in case someone would like to give feedback or would like to collaborate on one of them.
I think data is underrated among the alignment community (synthetic/transformed data even more). I have been thinking about it from the perspective of pre-training and post-training. My initial look into synthetic data was related to online learning and essentially controlling model behaviour. I was interested in papers like this one by Google, where they significantly reduce sycophancy in an LLM via 1k synthetically generated examples. Data shapes behaviour, and I think many people do not acknowledge this enough (which sometimes leads them to make confused conclusions about model behaviour).
In terms of specific research projects, my current ideas fall into these kinds of buckets:
Pre-training close to the basin of attraction for alignment
- How much can we improve "Pretraining Language Models with Human Preferences"? I'd like to transform training in various ways (as mentioned in your posts). For example, I could take fineweb and pre-train a GPT-2 sized model with the original dataset and a transformed version. Unclear so far which things I'd like to measure the most at that model size, though. A downstream experiment: is one model more likely to reward hack over the other? Does shard theory help us come up with useful experiments (pre-training with human feedback is almost like reinforcing behaviour and leveraging some form of shard theory)? Note that Google used a similar pre-training scheme for PaLM 2:
- How can the "basin of attraction for alignment" be mathematically formalized?
- Trying to the impact of systematic errors:
- Studying reward misspecification: do the reward labels have a systematic effect and bias in pushing the model? How much of the model's behaviour is determined by the data itself vs. the reward model's misspecification? My current reading of the literature on this is a bit unclear. However, there's a paper saying: "We present a novel observation about the behaviour of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards."
- How do we design the training curriculum to significantly bias the model's pre-training close to the basin of attraction for alignment?
- Studying some form of iterative training where we have a synthetically trained model vs a normally trained model and then measure things like model drift. For example, is the model more likely to drift (in an online setting) in ways we wouldn't want it to if it is pre-trained on normal text, but the process is more safely guided through synthetic pre-training?
- Part of the alignment challenge (for example, the concern of scheming AIs) is that the order in which the model learns things might matter. For example, you'd want the model to internalize a solid world model of human values before it gains the situational awareness required to manipulate its training process (scheme). So, can we design a training curriculum for specific capabilities s.t. the model learns capabilities in an ideal sequence?
Data attribution project ideas
- How to make this approach work in tandem with unlearning?
- Use data attribution methods to understand how specific data shapes model behaviour and use that information to reconstruct pre-training to shape model behaviour in the way we want. For example, can we side-step the need for unlearning? Can these data attribution methods augment unlearning to work better?
- As Roger said in his comment, we can try to manage the dataset to prevent WMB-dangerous capabilities and things like self-replication. It's quite possible that unlearning will not be enough.
- Another project would be to fine-tune on a dataset with and without the dangerous capabilities we don't want and use that as a benchmark for unlearning methods (and how easy it is to fine-tune the capability back into the model).
- Including other methods beyond data attribution (e.g. SAEs) to measure model evolution through training.
- Is it possible to better understand and predict emergence via data attribution?
- Studying model generalization via data attribution (doing similar things to the influence functions paper, but through time). Though the most interesting behaviour may only come at scales I wouldn't have the compute for.
- Would there be value in using an early checkpoint in training and then training on the synthetic data from that point forward? At which point in training does this make sense to do?
It's cool that you point to @Tomek Korbak because I was wondering if we could think of ways to extend his Pretraining Language Models with Human Preferences paper in ways that Roger mentions in his post.
Happy to chat!
Just a heads up, it's been 2 months!
Recent paper I thought was cool:
In-Run Data Shapley: Data attribution method efficient enough for pre-training data attribution.
Essentially, it can track how individual data points (or clusters) impact model performance across pre-training. You just need to develop a set of validation examples to continually check the model's performance on those examples during pre-training. Amazingly, you can do this over the course of a single training run; no need to require multiple pre-training runs like other data attribution methods have required.
Other methods, like influence functions, are too computationally expensive to run during pre-training and can only be run post-training.
So, here's why this might be interesting from an alignment perspective:
- You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.
- Given that this is possible to run during pre-training, you might understand model behaviour at such a granular level that you can construct data mixtures/curriculums that push the model towards internalizing 'human values' much sooner than it develops behaviours or capabilities we wouldn't want. Or, you delay self-awareness and such much further along in the training process.
- In this @RogerDearnaley post, A "Bitter Lesson" Approach to Aligning AGI and ASI, Roger proposes training an AI on a synthetic dataset where all intelligences are motivated by the collective well-being of humanity. You are trying to bias the model to be as close to the basin of attraction for alignment as possible. In-Run Data Shapley could be used to construct such a dataset and guide the training process so that the training data best exemplifies the desired aligned behaviour.
If you are interested in this kind of research, let me know! I'd love to brainstorm some potential projects and then apply for funding if there is something promising there.
Ok, totally; there's no specific claim about ASI. Will edit the wording.
Hey @Zac Hatfield-Dodds, I noticed you are looking for citations; these are the interview bits I came across (and here at 47:31).
It's possible I misunderstood him; please correct me if I did!
I like this kind of idea and have been thinking about it myself. It just makes total sense that all of the training data for the model should at least be passed through a model and augmented/transformed in some fashion to make the next-generation training run on data that has been meticulously curated by a model following the ideal set of values/constitution we'd want to them to have. You give the 'bomb' example; I often used a "Mein Kampf" example where you place that kind of data in context to how we'd want an AI to interpret it rather than treating it as equal to any other piece of text.
The post reminds me of Beren's blog post: "Alignment in the Age of Synthetic Data."
This post also reminds me of the "Alignment Bitter Lesson" I've been ruminating on lately (because I was considering writing a short post on it):
If your alignment agenda doesn’t take into account growing model capabilities, it will be worthless.
Or Davidad's version:
Any alignment scheme that doesn’t have a good way to leverage increasing AI capabilities to automate most of the R&D required for creating the alignment tech will not be relevant.
Dario Amodei believes that LLMs/AIs can be aided to self-improve in a similar way to AlphaGo Zero (though LLMs/AIs will benefit from other things too, like scale), where the models can learn by themselves to gain significant capabilities.
The key for him is that Go has a set of rules that the AlphaGo model needs to abide by. These rules allow the model to become superhuman at Go with enough compute.
Dario essentially believes that to reach better capabilities, it will help to develop rules for all the domains we care about and that this will likely be possible for more real-world tasks (not just games like Go).
Therefore, I think the crux here is if you think it is possible to develop rules for science (physics, chemistry, math, biology) and other domains s.t., the models can do this sort of self-play to become superhuman for each of the things we care about.
So far, we have examples like AlphaGeometry, which relies on our ability to generate many synthetic examples to help the model learn. This makes sense for the geometry use case, but how do we know if this kind of approach will work for the kinds of things we actually care about? For games and geometry, this seems possible, but what about developing a cure for Alzheimer's or coming up with novel scientific breakthroughs?
So, you've got some of the following issues to resolve:
- Success metrics
- Potentially much slower feedback loops
- Need real-world testing
That said, I think Dario is banking on:
- AIs will have a large enough world model that they can essentially set up 'rules' that provide enough signal to the model in domains other than games and 'special cases' like geometry. For example, they can run physics simulations of optimal materials informed by the latest research papers and use key metrics for the simulation as high-quality signals to reduce the amount of real-world feedback loops needed. Or, code having unit tests along with the goal.
- Most of the things we care about (like writing code) will be able to go beyond superhuman, which will then lift up other domains that wouldn't be able to become superhuman without it.
- Science has been bottlenecked by slow humans, increasing complexity and bad coordination, AIs will be able to resolve these issues.
- Even if you can't generate novel breakthrough synthetic data, you can use synthetic data to nudge your model along the path to making breakthroughs.
Thoughts?