Posts
Comments
I’d be curious to know if there’s variability in the “hours worked per week” given that people might work more hours during a short program vs a longer-term job (to keep things sustainable).
Imagine there was an AI-suggestion tool that could predict reasons why you agree/disagree-voted on a comment, and you just had to click one of the generated answers to provide a bit of clarity at a low cost.
Completely agree. I remember a big shift in my performance when I went from "I'm just using programming so that I can eventually build a startup, where I'll eventually code much less" to "I am a programmer, and I am trying to become exceptional at it." The shift in mindset was super helpful.
This is one of the reasons I think 'independent' research is valuable, even if it isn't immediately obvious from a research output (papers, for example) standpoint.
That said, I've definitely had the thought, "I should niche down into a specific area where there is already a bunch of infrastructure I can leverage and churn out papers with many collaborators because I expect to be in a more stable funding situation as an independent researcher. It would also make it much easier to pivot into a role at an organization if I want to or necessary. It would definitely be a much more stable situation for me."(And I also agree that specialization is often underrated.)
Ultimately, I decided not to do this because I felt like there were already enough people in alignment/governance who would take the above option due to financial and social incentives and published directions seeming more promising. However, since this makes me produce less output, I hope this is something grantmakers keep in consideration for my future grant applications.
I think it's up to you and how you write. English isn't my first language, so I've found it useful. I also don't accept like 50% of the suggestions. But yeah, looking at the plan now, I think I could get off the Pro plan and see if I'm okay not paying for it.
It's definitely not the thing I care about most on the list.
There are multiple courses, though it's fairly new. They have one on full-stack development (while using Cursor and other things) and Replit Agents. I've been following it to learn fast web development, and I think it's a good starting point for getting an overview of building an actual product on a website you can eventually sell or get people to use.
Somewhat relevant blog post by @NunoSempere: https://nunosempere.com/blog/2024/09/10/chance-your-startup-will-succeed/
As an aside, I have considered that samplers were underinvestigated and that they would lead to some capability boosts. It's also one of the things I'd consider testing out to improve LLMs for automated/augmented alignment research.
The importance of Entropy
Given that there's been a lot of talk about using entropy during sampling of LLMs lately (related GitHub), I figured I'd share a short post I wrote for my website before it became a thing:
Imagine you're building a sandcastle on the beach. As you carefully shape the sand, you're creating order from chaos - this is low entropy. But leave that sandcastle for a while, and waves, wind, and footsteps will eventually reduce it back to a flat, featureless beach - that's high entropy.
Entropy is nature's tendency to move from order to disorder, from concentration to dispersion. It's why hot coffee cools down, why ice cubes melt in your drink, and why it's easier to make a mess than to clean one up. In the grand scheme of things, entropy is the universe's way of spreading energy out evenly, always moving towards a state of balance or equilibrium.
Related to entropy, the Earth radiates back approximately the same energy the Sun radiates towards it. The Sun radiates fewer photons at a higher energy wavelength (mostly visible and near-infrared) than the Earth, which radiates way more photons, but each photon has much lower energy (mostly infrared).
If the Earth didn't radiate back the same energy, the Earth would heat up continuously, which would obviously be unstable.
The cool thing is that Entropy (the tendency of energy to spread out, ex: the universe expanding or fart spreading across the room) is possibly what made life happen, and it was necessary to have a constant stream of low-entropy energy (high-energy photon packets) coming from the Sun.
If you have a constant stream of low-entropy energy from the Sun, it may favour structures that dissipate energy, thereby increasing Entropy (keeping the energy constant while spreading it). Entropy is an important ingredient in the emergence of life, how we ended up going from random clumps of atoms to plants to many complex organisms on Earth.
Dissipative structures: Living organisms are complex systems that maintain their organization by dissipating energy and matter. They take in low-entropy energy (sunlight/food) and release higher-entropy energy (heat), increasing the universe's entropy while maintaining their own order.
Life isn't just an accident but potentially an inevitable consequence of thermodynamics. Organisms can be thought of as highly efficient entropy producers, accelerating the universe's march toward maximum entropy while creating local pockets of increased order and complexity.
The emergence of life might be a natural result of physical laws, occurring wherever conditions allow for the formation of systems that can effectively dissipate energy.
One thing I'd like to ponder more about: if entropy is a necessary component for the emergence of life, what could it mean for AI? Due to entropy, the world has been biased towards increasingly complex organisms. How does that trend impact the future of the universe? Will we see an unprecedented acceleration of the universe's march toward maximum entropy?
Fair enough. For what it's worth, I've thought a lot about the kind of thing you describe in that comment and partially committing to this direction because I feel like I have enough intuition and insight that those other tools for thought failed to incorporate.
Just to clarify, do you only consider 'strong human intelligence amplification' through some internal change, or do you also consider AIs to be part of that? As in, it sounds like you are saying we currently lack the intelligence to make significant progress on alignment research and consider increasing human intelligence to be the best way to make progress. Are you also of the opinion that using AIs to augment alignment researchers and progressively automate alignment research is doomed and not worth consideration? If not, then here.
I'm in the process of trying to build an org focused on "automated/augmented alignment research." As part of that, I've been thinking about which alignment research agendas could be investigated in order to make automated alignment safer and trustworthy. And so, I've been thinking of doing internal research on AI control/security and using that research internally to build parts of the system I intend to build. I figured this would be a useful test case for applying the AI control agenda and iterating on issues we face in implementation, and then sharing those insights with the wider community.
Would love to talk to anyone who has thoughts on this or who would introduce me to someone who would fund this kind of work.
I quickly wrote up some rough project ideas for ARENA and LASR participants, so I figured I'd share them here as well. I am happy to discuss these ideas and potentially collaborate on some of them.
Alignment Project Ideas (Oct 2, 2024)
1. Improving "A Multimodal Automated Interpretability Agent" (MAIA)
Overview
MAIA (Multimodal Automated Interpretability Agent) is a system designed to help users understand AI models by combining human-like experimentation flexibility with automated scalability. It answers user queries about AI system components by iteratively generating hypotheses, designing and running experiments, observing outcomes, and updating hypotheses.
MAIA uses a vision-language model (GPT-4V, at the time) backbone equipped with an API of interpretability experiment tools. This modular system can address both "macroscopic" questions (e.g., identifying systematic biases in model predictions) and "microscopic" questions (e.g., describing individual features) with simple query modifications.
This project aims to improve MAIA's ability to either answer macroscopic questions or microscopic questions on vision models.
2. Making "A Multimodal Automated Interpretability Agent" (MAIA) work with LLMs
MAIA is focused on vision models, so this project aims to create a MAIA-like setup, but for the interpretability of LLMs.
Given that this would require creating a new setup for language models, it would make sense to come up with simple interpretability benchmark examples to test MAIA-LLM. The easiest way to do this would be to either look for existing LLM interpretability benchmarks or create one based on interpretability results we've already verified (would be ideal to have a ground truth). Ideally, the examples in the benchmark would be simple, but new enough that the LLM has not seen them in its training data.
3. Testing the robustness of Critique-out-Loud Reward (CLoud) Models
Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.
The goal for this project would be to test the robustness of CLoud reward models. For example, are the CLoud RMs (discriminators) more robust to jailbreaking attacks from the policy (generator)? Do the CLoud RMs generalize better?
From an alignment perspective, we would want RMs that generalize further out-of-distribution (and ideally, always more than the generator we are training).
4. Synthetic Data for Behavioural Interventions
Simple synthetic data reduces sycophancy in large language models by (Google) reduced sycophancy in LLMs with a fairly small number of synthetic data examples. This project would involve testing this technique for other behavioural interventions and (potentially) studying the scaling laws. Consider looking at the examples from the Model-Written Evaluations paper by Anthropic to find some behaviours to test.
5. Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hidden away the superposition in other parts of the network, making SoLU unhelpful in making the models more interpretable
That said, we hope to find that we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Methodology:
- Identify a set of regularization techniques (e.g., L1 regularization, weight pruning, activation sparsity) to be applied during fine-tuning.
- Fine-tune pre-trained language models with different regularization techniques and hyperparameters.
- Evaluate the fine-tuned models using interpretability tools (e.g., attention visualization, probing classifiers) and editability benchmarks (e.g., ROME).
- Analyze the impact of regularization on model interpretability, editability, and performance.
- Investigate the relationship between interpretability, editability, and model alignment.
Expected Outcomes:
- Quantitative assessment of the effectiveness of different regularization techniques for improving interpretability and editability.
- Insights into the trade-offs between interpretability, editability, and model performance.
- Recommendations for regularization techniques that enhance interpretability and editability while maintaining model performance and alignment.
6. Quantifying the Impact of Reward Misspecification on Language Model Behavior
Investigate how misspecified reward functions influence the behavior of language models during fine-tuning and measure the extent to which the model's outputs are steered by the reward labels, even when they contradict the input context. We hope to better understand language model training dynamics. Additionally, we expect online learning to complicate things in the future, where models will be able to generate the data they may eventually be trained on. We hope that insights from this work can help us prevent catastrophic feedback loops in the future. For example, if model behavior is mostly impacted by training data, we may prefer to shape model behavior through synthetic data (it has been shown we can reduce sycophancy by doing this).
Prior works:
- The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models by Alexander Pan, Kush Bhatia, Jacob Steinhardt
- Survival Instinct in Offline Reinforcement Learning by Anqi Li, Dipendra Misra, Andrey Kolobov, Ching-An Cheng
- Simple synthetic data reduces sycophancy in large language models by (Google), Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le
- Scaling Laws for Reward Model Overoptimization by (OpenAI), Leo Gao, John Schulman, Jacob Hilton
- On the Sensitivity of Reward Inference to Misspecified Human Models by Joey Hong, Kush Bhatia, Anca Dragan
Methodology:
- Create a diverse dataset of text passages with candidate responses and manually label them with coherence and misspecified rewards.
- Fine-tune pre-trained language models using different reward weighting schemes and hyperparameters.
- Evaluate the generated responses using automated metrics and human judgments for coherence and misspecification alignment.
- Analyze the influence of misspecified rewards on model behavior and the trade-offs between coherence and misspecification alignment.
- Use interpretability techniques to understand how misspecified rewards affect the model's internal representations and decision-making process.
Expected Outcomes:
- Quantitative measurements of the impact of reward misspecification on language model behavior.
- Insights into the trade-offs between coherence and misspecification alignment.
- Interpretability analysis revealing the effects of misspecified rewards on the model's internal representations.
7. Investigating Wrong Reasoning for Correct Answers
Understand the underlying mechanisms that lead to language models producing correct answers through flawed reasoning, and develop techniques to detect and mitigate such behavior. Essentially, we want to apply interpretability techniques to help us identify which sets of activations or token-layer pairs impact the model getting the correct answer when it has the correct reasoning versus when it has the incorrect reasoning. The hope is to uncover systematic differences as to when it is not relying on its chain-of-thought at all and when it does leverage its chain-of-thought to get the correct answer.
[EDIT Oct 2nd, 2024] This project intends to follow a similar line of reasoning as described in this post and this comment. The goal is to study chains-of-thought and improve faithfulness without suffering an alignment tax so that we can have highly interpretable systems through their token outputs and prevent loss of control. The project doesn't necessarily need to rely only on model internals.
Related work:
- Decomposing Predictions by Modeling Model Computation by Harshay Shah, Andrew Ilyas, Aleksander Madry
- Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models by Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
- On Measuring Faithfulness or Self-consistency of Natural Language Explanations by Letitia Parcalabescu, Anette Frank
- Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting by Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman
- Measuring Faithfulness in Chain-of-Thought Reasoning by Tamera Lanham et al.
Methodology:
- Curate a dataset of questions and answers where language models are known to provide correct answers but with flawed reasoning.
- Use interpretability tools (e.g., attention visualization, probing classifiers) to analyze the model's internal representations and decision-making process for these examples.
- Develop metrics and techniques to detect instances of correct answers with flawed reasoning.
- Investigate the relationship between model size, training data, and the prevalence of flawed reasoning.
- Propose and evaluate mitigation strategies, such as data augmentation or targeted fine-tuning, to reduce the occurrence of flawed reasoning.
Expected Outcomes:
- Insights into the underlying mechanisms that lead to correct answers with flawed reasoning in language models.
- Metrics and techniques for detecting instances of flawed reasoning.
- Empirical analysis of the factors contributing to flawed reasoning, such as model size and training data.
- Proposed mitigation strategies to reduce the occurrence of flawed reasoning and improve model alignment.
I'm exploring the possibility of building an alignment research organization focused on augmenting alignment researchers and progressively automating alignment research (yes, I have thought deeply about differential progress and other concerns). I intend to seek funding in the next few months, and I'd like to chat with people interested in this kind of work, especially great research engineers and full-stack engineers who might want to cofound such an organization. If you or anyone you know might want to chat, let me know! Send me a DM, and I can send you some initial details about the organization's vision.
Here are some things I'm looking for in potential co-founders:
Need
- Strong software engineering skills
Nice-to-have
- Experience in designing LLM agent pipelines with tool-use
- Experience in full-stack development
- Experience in scalable alignment research approaches (automated interpretability/evals/red-teaming)
Given today's news about Mira (and two other execs leaving), I figured I should bump this again.
But also note that @Zach Stein-Perlman has already done some work on this (as he noted in his edit): https://ailabwatch.org/resources/integrity/.
Note, what is hard to pinpoint when it comes to S.A. is that many of the things he does have been described as "papercuts". This is the kind of thing that makes it hard to make a convincing case for wrongdoing.
And while flattering to Brockman, there is nothing about Murati - free tip to all my VC & DL startup acquaintances, there's a highly competent AI manager who's looking for exciting new opportunities, even if she doesn't realize it yet.
Heh, here it is: https://x.com/miramurati/status/1839025700009030027
I completely agree, and we should just obviously build an organization around this. Automating alignment research while also getting a better grasp on maximum current capabilities (and a better picture of how we expect it to grow).
(This is my intention, and I have had conversations with Bogdan about this, but I figured I'd make it more public in case anyone has funding or ideas they would like to share.)
Here's what I'm currently using and how much I am paying:
- Superwhisper (or other new Speech-to-Text that leverage "LLMs for rewriting" apps). Under $8.49 per month. You can use different STT models (different speed and accuracy for each) and LLM for rewriting the transcript based on a prompt you give the models. You can also have different "modes", meaning that you can have the model take your transcript and write code instructions in a pre-defined format when you are in an IDE, turn a transcript into a report when writing in Google Docs, etc. There is also an iOS app.
- Cursor Pro ($20-30/month). Switch to API credits when the slow responses take too long. (You can try Zed (an IDE) too if you want. I've only used it a little bit, but Anthropic apparently uses it and there's an exclusive "fast-edit" feature with the Anthropic models.)
- Claude.ai Pro ($20/month). You could consider getting two accounts or a Team account to worry less about hitting the token limit.
- Chatgpt.com Pro account ($20/month). Again, can get a second account to have more o1-preview responses from the chat.
- Aider (~$10/month max in API credits if used with Cursor Pro).
- Google Colab Pro subscription ($9.99/month). You could get the Pro+ plan for $49.99/month.
- Google One 2TB AI Premium plan ($20/month). This comes with Gemini chat and other AI features. I also sign up to get the latest features earlier, like Notebook LM and Illuminate.
- v0 chat ($20/month). Used for creating Next.js websites quickly.
- jointakeoff.com ($22.99/month) for courses on using AI for development.
- I still have GitHub Copilot (along with Cursor's Copilot++) because I bought a long-term subscription.
- Grammarly ($12/month).
- Reader by ElevenLabs (Free, for now). Best quality TTS app out there right now.
Other things I'm considering paying for:
- Perplexity AI ($20/month).
- Other AI-focused courses that help me best use AI for productivity (web dev or coding in general).
- Suno AI ($8/month). I might want to make music with it.
Apps others may be willing to pay for:
- Warp, an LLM-enabled terminal ($20/month). I don't use the free version enough to upgrade to the paid version.
There are ways to optimize how much I'm paying to save a bit of cash for sure. But I'm currently paying roughly $168/month.
That said, I am also utilizing research credits from Anthropic, which could range from $500 to $2000 depending on the month. In addition, I'm working on an "alignment research assistant" which will leverage LLMs, agents, API calls to various websites, and more. If successful, I could see this project absorbing hundreds of thousands in inference costs.
Note: I am a technical alignment researcher who also works on trying to augment alignment researchers and eventually automate more and more of alignment research so I'm biasing myself to overspend on products in order to make sure I'm aware of the bleeding-edge setup.
News on the next OAI GPT release:
Nagasaki, CEO of OpenAI Japan, said, "The AI model called 'GPT Next' that will be released in the future will evolve nearly 100 times based on past performance. Unlike traditional software, AI technology grows exponentially."
https://www.itmedia.co.jp/aiplus/articles/2409/03/news165.html
The slide clearly states 2024 "GPT Next". This 100 times increase probably does not refer to the scaling of computing resources, but rather to the effective computational volume + 2 OOMs, including improvements to the architecture and learning efficiency. GPT-4 NEXT, which will be released this year, is expected to be trained using a miniature version of Strawberry with roughly the same computational resources as GPT-4, with an effective computational load 100 times greater. Orion, which has been in the spotlight recently, was trained for several months on the equivalent of 100k H100 compared to GPT-4 (EDIT: original tweet said 10k H100s, but that was a mistake), adding 10 times the computational resource scale, making it +3 OOMs, and is expected to be released sometime next year.
Note: Another OAI employee seemingly confirms this (I've followed them for a while, and they are working on inference).
- IMO if you end up integrating something like this in LW I think it would be net positive. Specially if you can link it to @stampy or similar to ask for clarification questions about concepts, ...
I was thinking of linking it to an Alignment Research Assistant I've been working on, too.
I just started using this extension, but basically, every time I'm about to read a long post, I feed it and all the comments to Claude chat. The question-flow is often:
- What are the key points of the post?
- (Sometimes) Explain x in more detail in relation to y or some specific clarification questions.
- What are the key criticisms of this post based on the comments?
- How does the author respond to those criticisms?
- (Sometimes) Follow-up questions about the post.
Easy LessWrong post to LLM chat pipeline (browser side-panel)
I started using Sider as @JaimeRV recommended here. Posting this as a top-level shortform since I think other LessWrong users should be aware of it.
Website with app and subscription option. Chrome extension here.
You can either pay for the monthly service and click the "summarize" feature on a post and get the side chat window started or put your OpenAI API / ChatGPT Pro account in the settings and just cmd+a the post (which automatically loads the content in the chat so you can immediately ask a question; "explain the key points of the post", "help me really understand what deep deceptiveness means").
Afaict, it only works with Sonnet-3.5 through the paid subscription.
Thanks for sharing, will give it a shot!
Edit: Sider seems really great! I wish it could connect to Claude chat (without using credits), so I will probably just use both extensions.
Low-hanging fruit:
Loving this Chrome extension so far: YouTube Summary with ChatGPT & Claude - Chrome Web Store
It adds a button on YouTube videos where, when you click it (or keyboard shortcut ctrl + x + x), it opens a new tab into the LLM chat of your choice, pastes the entire transcript in the chat along with a custom message you can add as a template ("Explain the key points.") and then automatically presses enter to get the chat going.
It's pretty easy to get a quick summary of a YouTube video without needing to watch the whole thing and then ask follow-up questions. It seems like an easy way to save time or do a quick survey of many YouTube videos. (I would not have bothered going through the entire "Team 2 | Lo fi Emulation @ Whole Brain Emulation Workshop 2024" talk, so it was nice to get the quick summary.)
I usually like getting a high-level overview of the key points of a talk to have a mental mind map skeleton before I dive into the details.
You can even set up follow-up prompt buttons (which works with ChatGPT but currently does not work with Claude for me), though I'm not sure what I'd use. Maybe something like, "Why is this important to AI alignment?"
The default prompt is "Give a summary in 5 bullet points" or something similar. I prefer not to constrain Claude and change it to something like, "Explain the key points."
Synthesized various resources for this "pre-training for alignment" type work:
- Data
- Synthetic Data
- The RetroInstruct Guide To Synthetic Text Data
- Alignment In The Age of Synthetic Data
- Leveraging Agentic AI for Synthetic Data Generation
- **AutoEvol**: Automatic Instruction Evolving for Large Language Models We build a fully automated Evol-Instruct pipeline to create high-quality, highly complex instruction tuning data
- Synthetic Data Generation and AI Feedback notebook
- The impact of models training on their own outputs and how its actually done well in practice
- Google presents Best Practices and Lessons Learned on Synthetic Data for Language Models
- Transformed/Enrichment of Data
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. TLDR: You can train 3x faster and with upto 10x lesser data with just synthetic rephrases of the web!
- Better Synthetic Data by Retrieving and Transforming Existing Datasets
- Rho-1: Not All Tokens Are What You Need RHO-1-1B and 7B achieves SotA results of 40.6% and 51.8% on MATH dataset, respectively — matching DeepSeekMath with only 3% of the pretraining tokens.
- Data Attribution
- In-Run Data Shapley
- Scaling Laws for the Value of Individual Data Points in Machine Learning We show how some data points are only valuable in small training sets; others only shine in large datasets.
- What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
- Data Mixtures
- Methods for finding optimal data mixture
- Curriculum Learning
- Active Data Selection
- MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models MATES significantly elevates the scaling curve by selecting the data based on the model's evolving needs.
- Data Filtering
- Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic Argues that data curation cannot be agnostic of the total compute that a model will be trained for Github
- How to Train Data-Efficient LLMs Models trained on ASK-LLM data consistently outperform full-data training—even when we reject 90% of the original dataset, while converging up to 70% faster
- Synthetic Data
- On Pre-Training
- Pre-Training from Human Preferences
- Ethan Perez wondering if jailbreaks would be solved with this pre-training approach
- LAION uses this approach for finegrained control over outputs during inference.
- Nora Belrose thinks that alignment via pre-training would make models more robust to unlearning (she doesn't say this, but this may be a good thing if you pre-train such that you don't need unlearning)
- Tomek describing some research direction for improving pre-training alignment
- Simple and Scalable Strategies to Continually Pre-train Large Language Models
- Neural Networks Learn Statistics of Increasing Complexity
- Pre-Training from Human Preferences
- Pre-Training towards the basin of attraction for alignment
- Alignment techniques
- AlignEZ: Using the self-generated preference data, we identify the subspaces that: (1) facilitate and (2) are harmful to alignment. During inference, we surgically modify the LM embedding using these identified subspaces. Jacques note: could we apply this iteratively throughout training (and other similar methods)?
- What do we mean by "alignment"? What makes the model safe?
- Values
- On making the model "care"
This just got some massive downvotes. Would like to know why. My guess is "This can be dual-use. Therefore, it's bad," but if not, it would be nice to know.
I believe that he says there is a special quality to the olive oil he chose and that the average bottle of olive oil does not provide the claimed ideal benefits of olive oil for some reason. I'm not sure how true this is and how much he is marking up the price, even if it is true.
Maybe, I haven’t compared the prices (I think he says it’s similar to the quality you would get from whole foods at a grocery store), but he gives all of the recipes for free if people want to do them at home.
he's unmarried (though has 3 kids, I don't know his involvement)
He is divorced, and one of his sons currently lives with him (also left Mormonism), at least for this year and maybe indefinitely. The rest of the family is still into Mormonism, and his wife tried to sue him for millions, and she lost (false accusations). It is unclear if he interacts much with the other children.
Evidence his system can motivate and provide superior results to other diet-and-exercise regimens on the basis of his own personal results is, of course, massively confounded.
He encourages people to measure things for themselves and not follow recommendations blindly. When he does give recommendations for things like sleep, he mostly suggests things that are basically free. The only expensive thing is the Whoop sleep tracker, which he considers important for figuring out what works for each individual.
Hey everyone, in collaboration with Apart Research, I'm helping organize a hackathon this weekend to build tools for accelerating alignment research. This hackathon is very much related to my effort in building an "Alignment Research Assistant."
Here's the announcement post:
2 days until we revolutionize AI alignment research at the Research Augmentation Hackathon!
As AI safety researchers, we pour countless hours into crucial work. It's time we built tools to accelerate our efforts! Join us in creating AI assistants that could supercharge the very research we're passionate about.
Date: July 26th to 28th, online and in-person
Prizes: $2,000 in prizes
Why join?
* Build tools that matter for the future of AI
* Learn from top minds in AI alignment
* Boost your skills and portfolio
We've got a Hackbook with an exciting project to work on waiting for you! No advanced AI knowledge required - just bring your creativity!
Register now: Sign up on the website here, and don't miss this chance to shape the future of AI research!
Yeah, I was thinking about using SAD. The main issue is that for non-AGI-lab-sized models, you'll have a tough time eliciting SA. However, we could potentially focus on precursor capabilities and such.
If you are concerned about capabilities like SA, then you might ask yourself, "it seems highly unlikely that you can figure out which data points impact SA the most because it will likely be a mix of many things and each data point will play a role in accumulating to SA." My guess is that you can break down SA into enough precursor capabilities that this approach can still be highly predictive even if it isn't 100%/
I think forcing them to retrieve in-context sounds good, but I also think labs may not want this, not sure. Basically, they'll want to train things into the model eventually, like for many CoT things.
Agreed on having a validation set for reducing the alignment tax.
Why aren't you doing research on making pre-training better for alignment?
I was on a call today, and we talked about projects that involve studying how pre-trained models evolve throughout training and how we could guide the pre-training process to make models safer. For example, could models trained on synthetic/transformed data make models significantly more robust and essentially solve jailbreaking? How about the intersection of pretraining from human preferences and synthetic data? Could the resulting model be significantly easier to control? How would it impact the downstream RL process? Could we imagine a setting where we don't need RL (or at least we'd be able to confidently use resulting models to automate alignment research)? I think many interesting projects could fall out of this work.
So, back to my main question: why aren't you doing research on making pre-training better for alignment? Is it because it's too expensive and doesn't seem like a low-hanging fruit? Or do you feel it isn't a plausible direction for aligning models?
We were wondering if there are technical bottlenecks that would make this kind of research more feasible for alignment research to better study how to guide the pretraining process in a way that benefits alignment. As in, would researchers be more inclined to do experiments in this direction if the entire pre-training code was handled and you'd just have to focus on whatever specific research question you have in mind? If we could access a large amount of compute (let's say, through government resources) to do things like data labeling/filtering and pre-training multiple models, would this kind of work be more interesting for you to pursue?
I think many alignment research directions have grown simply because they had low-hanging fruits that didn't require much compute (e.g., evals, and mech interp). It seems we've implicitly left all of the high-compute projects for the AGI labs to figure out. But what if we weren't as bottlenecked on this anymore? It's possible to retrain GPT-2 1.5B with under 700$ now (and 125M for 20$). I think we can find ways to do useful experiments, but my guess is that the level of technical expertise required to get it done is a bit high, and alignment researchers would rather avoid these kinds of projects since they are currently high-effort.
I talk about other related projects here.
We're doing a hackathon with Apart Research on 26th. I created a list of problem statements for people to brainstorm off of.
Pro-active insight extraction from new research
Reading papers can take a long time and is often not worthwhile. As a result, researchers might read too many papers or almost none. However, there are still valuable nuggets in papers and posts. The issue is finding them. So, how might we design an AI research assistant that proactively looks at new papers (and old) and shares valuable information with researchers in a naturally consumable way? Part of this work involves presenting individual research with what they would personally find valuable and not overwhelm them with things they are less interested in.
How can we improve the LLM experience for researchers?
Many alignment researchers will use language models much less than they would like to because they don't know how to prompt the models, it takes time to create a valuable prompt, the model doesn't have enough context for their project, the model is not up-to-date on the latest techniques, etc. How might we make LLMs more useful for researchers by relieving them of those bottlenecks?
Simple experiments can be done quickly, but turning it into a full project can take a lot of time
One key bottleneck for alignment research is transitioning from an initial 24-hour simple experiment in a notebook to a set of complete experiments tested with different models, datasets, interventions, etc. How can we help researchers move through that second research phase much faster?
How might we use AI agents to automate alignment research?
As AI agents become more capable, we can use them to automate parts of alignment research. The paper "A Multimodal Automated Interpretability Agent" serves as an initial attempt at this. How might we use AI agents to help either speed up alignment research or unlock paths that were previously inaccessible?
How can we nudge research toward better objectives (agendas or short experiments) for their research?
Even if we make researchers highly efficient, it means nothing if they are not working on the right things. Choosing the right objectives (projects and next steps) through time can be the difference between 0x to 1x to +100x. How can we ensure that researchers are working on the most valuable things?
What can be done to accelerate implementation and iteration speed?
Implementation and iteration speed on the most informative experiments matter greatly. How can we nudge them to gain the most bits of information in the shortest time? This involves helping them work on the right agendas/projects and helping them break down their projects in ways that help them make progress faster (and avoiding ending up tunnel-visioned on the wrong project for months/years).
How can we connect all of the ideas in the field?
How can we integrate the open questions/projects in the field (with their critiques) in such a way that helps the researcher come up with well-grounded research directions faster? How can we aid them in choosing better directions and adjust throughout their research? This kind of work may eventually be a precursor to guiding AI agents to help us develop better ideas for alignment research.
Good to know! So I guess people were expecting that every company is running a “check if canary string is anywhere in our entire dataset and remove document if so” function?
If you just google the string, there are many instances of people sharing it verbatim. Would be good to do further testing to know if it was actually trained on the benchmark or learned through many other sources.
I sent some related project ideas to @RogerDearnaley via DMs, but figured I should share them here to in case someone would like to give feedback or would like to collaborate on one of them.
I think data is underrated among the alignment community (synthetic/transformed data even more). I have been thinking about it from the perspective of pre-training and post-training. My initial look into synthetic data was related to online learning and essentially controlling model behaviour. I was interested in papers like this one by Google, where they significantly reduce sycophancy in an LLM via 1k synthetically generated examples. Data shapes behaviour, and I think many people do not acknowledge this enough (which sometimes leads them to make confused conclusions about model behaviour).
In terms of specific research projects, my current ideas fall into these kinds of buckets:
Pre-training close to the basin of attraction for alignment
- How much can we improve "Pretraining Language Models with Human Preferences"? I'd like to transform training in various ways (as mentioned in your posts). For example, I could take fineweb and pre-train a GPT-2 sized model with the original dataset and a transformed version. Unclear so far which things I'd like to measure the most at that model size, though. A downstream experiment: is one model more likely to reward hack over the other? Does shard theory help us come up with useful experiments (pre-training with human feedback is almost like reinforcing behaviour and leveraging some form of shard theory)? Note that Google used a similar pre-training scheme for PaLM 2:
- How can the "basin of attraction for alignment" be mathematically formalized?
- Trying to the impact of systematic errors:
- Studying reward misspecification: do the reward labels have a systematic effect and bias in pushing the model? How much of the model's behaviour is determined by the data itself vs. the reward model's misspecification? My current reading of the literature on this is a bit unclear. However, there's a paper saying: "We present a novel observation about the behaviour of offline reinforcement learning (RL) algorithms: on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards."
- How do we design the training curriculum to significantly bias the model's pre-training close to the basin of attraction for alignment?
- Studying some form of iterative training where we have a synthetically trained model vs a normally trained model and then measure things like model drift. For example, is the model more likely to drift (in an online setting) in ways we wouldn't want it to if it is pre-trained on normal text, but the process is more safely guided through synthetic pre-training?
- Part of the alignment challenge (for example, the concern of scheming AIs) is that the order in which the model learns things might matter. For example, you'd want the model to internalize a solid world model of human values before it gains the situational awareness required to manipulate its training process (scheme). So, can we design a training curriculum for specific capabilities s.t. the model learns capabilities in an ideal sequence?
Data attribution project ideas
- How to make this approach work in tandem with unlearning?
- Use data attribution methods to understand how specific data shapes model behaviour and use that information to reconstruct pre-training to shape model behaviour in the way we want. For example, can we side-step the need for unlearning? Can these data attribution methods augment unlearning to work better?
- As Roger said in his comment, we can try to manage the dataset to prevent WMB-dangerous capabilities and things like self-replication. It's quite possible that unlearning will not be enough.
- Another project would be to fine-tune on a dataset with and without the dangerous capabilities we don't want and use that as a benchmark for unlearning methods (and how easy it is to fine-tune the capability back into the model).
- Including other methods beyond data attribution (e.g. SAEs) to measure model evolution through training.
- Is it possible to better understand and predict emergence via data attribution?
- Studying model generalization via data attribution (doing similar things to the influence functions paper, but through time). Though the most interesting behaviour may only come at scales I wouldn't have the compute for.
- Would there be value in using an early checkpoint in training and then training on the synthetic data from that point forward? At which point in training does this make sense to do?
It's cool that you point to @Tomek Korbak because I was wondering if we could think of ways to extend his Pretraining Language Models with Human Preferences paper in ways that Roger mentions in his post.
Happy to chat!
Just a heads up, it's been 2 months!
Recent paper I thought was cool:
In-Run Data Shapley: Data attribution method efficient enough for pre-training data attribution.
Essentially, it can track how individual data points (or clusters) impact model performance across pre-training. You just need to develop a set of validation examples to continually check the model's performance on those examples during pre-training. Amazingly, you can do this over the course of a single training run; no need to require multiple pre-training runs like other data attribution methods have required.
Other methods, like influence functions, are too computationally expensive to run during pre-training and can only be run post-training.
So, here's why this might be interesting from an alignment perspective:
- You might be able to set up a bunch of validation examples to test specific behaviour in the models so that we are hyper-aware of which data points contribute the most to that behaviour. For example, self-awareness or self-preservation.
- Given that this is possible to run during pre-training, you might understand model behaviour at such a granular level that you can construct data mixtures/curriculums that push the model towards internalizing 'human values' much sooner than it develops behaviours or capabilities we wouldn't want. Or, you delay self-awareness and such much further along in the training process.
- In this @RogerDearnaley post, A "Bitter Lesson" Approach to Aligning AGI and ASI, Roger proposes training an AI on a synthetic dataset where all intelligences are motivated by the collective well-being of humanity. You are trying to bias the model to be as close to the basin of attraction for alignment as possible. In-Run Data Shapley could be used to construct such a dataset and guide the training process so that the training data best exemplifies the desired aligned behaviour.
If you are interested in this kind of research, let me know! I'd love to brainstorm some potential projects and then apply for funding if there is something promising there.
Ok, totally; there's no specific claim about ASI. Will edit the wording.
Hey @Zac Hatfield-Dodds, I noticed you are looking for citations; these are the interview bits I came across (and here at 47:31).
It's possible I misunderstood him; please correct me if I did!
I like this kind of idea and have been thinking about it myself. It just makes total sense that all of the training data for the model should at least be passed through a model and augmented/transformed in some fashion to make the next-generation training run on data that has been meticulously curated by a model following the ideal set of values/constitution we'd want to them to have. You give the 'bomb' example; I often used a "Mein Kampf" example where you place that kind of data in context to how we'd want an AI to interpret it rather than treating it as equal to any other piece of text.
The post reminds me of Beren's blog post: "Alignment in the Age of Synthetic Data."
This post also reminds me of the "Alignment Bitter Lesson" I've been ruminating on lately (because I was considering writing a short post on it):
If your alignment agenda doesn’t take into account growing model capabilities, it will be worthless.
Or Davidad's version:
Any alignment scheme that doesn’t have a good way to leverage increasing AI capabilities to automate most of the R&D required for creating the alignment tech will not be relevant.
Dario Amodei believes that LLMs/AIs can be aided to self-improve in a similar way to AlphaGo Zero (though LLMs/AIs will benefit from other things too, like scale), where the models can learn by themselves to gain significant capabilities.
The key for him is that Go has a set of rules that the AlphaGo model needs to abide by. These rules allow the model to become superhuman at Go with enough compute.
Dario essentially believes that to reach better capabilities, it will help to develop rules for all the domains we care about and that this will likely be possible for more real-world tasks (not just games like Go).
Therefore, I think the crux here is if you think it is possible to develop rules for science (physics, chemistry, math, biology) and other domains s.t., the models can do this sort of self-play to become superhuman for each of the things we care about.
So far, we have examples like AlphaGeometry, which relies on our ability to generate many synthetic examples to help the model learn. This makes sense for the geometry use case, but how do we know if this kind of approach will work for the kinds of things we actually care about? For games and geometry, this seems possible, but what about developing a cure for Alzheimer's or coming up with novel scientific breakthroughs?
So, you've got some of the following issues to resolve:
- Success metrics
- Potentially much slower feedback loops
- Need real-world testing
That said, I think Dario is banking on:
- AIs will have a large enough world model that they can essentially set up 'rules' that provide enough signal to the model in domains other than games and 'special cases' like geometry. For example, they can run physics simulations of optimal materials informed by the latest research papers and use key metrics for the simulation as high-quality signals to reduce the amount of real-world feedback loops needed. Or, code having unit tests along with the goal.
- Most of the things we care about (like writing code) will be able to go beyond superhuman, which will then lift up other domains that wouldn't be able to become superhuman without it.
- Science has been bottlenecked by slow humans, increasing complexity and bad coordination, AIs will be able to resolve these issues.
- Even if you can't generate novel breakthrough synthetic data, you can use synthetic data to nudge your model along the path to making breakthroughs.
Thoughts?
We haven't published this work yet (which is why I'm only writing a comment), but @Quintin Pope and I are hoping to make progress on this by comparing model M to model M', where M' = intervention(M), instead of only relying on auditing M'.
Note that intervention can mean anything, e.g., continued pre-training, RLHF, activation steering, and model editing.
Also, I haven't gone through the whole post yet, but so far I'm curious how "Mechanistically Eliciting Latent Behaviors in Language Models" future work will evolve when it comes to red-teaming and backdoor detection.
Alignment Researcher Assistant update.
Hey everyone, my name is Jacques, I'm an independent technical alignment researcher, primarily focused on evaluations, interpretability, and scalable oversight (more on my alignment research soon!). I'm now focusing more of my attention on building an Alignment Research Assistant (I've been focusing on my alignment research for 95% of my time in the past year). I'm looking for people who would like to contribute to the project. This project will be private unless I say otherwise (though I'm listing some tasks); I understand the dual-use nature and most criticism against this kind of work.
How you can help:
- Provide feedback on what features you think would be amazing in your workflow to produce high-quality research more efficiently.
- Volunteer as a beta-tester for the assistant.
- Contribute to one of the tasks below. (Send me a DM, and I'll give you access to the private Discord to work on the project.)
- Funding to hire full-time developers to build the features.
Here's the vision for this project:
How might we build an AI system that augments researchers to get us 5x or 10x productivity for the field as a whole?
The system is designed with two main mindsets in mind:
- Efficiency: What kinds of tasks do alignment researchers do, and how can we make them faster and more efficient?
- Objective: Even if we make researchers highly efficient, it means nothing if they are not working on the right things. How can we ensure that researchers are working on the most valuable things? How can we nudge them to gain the most bits of information in the shortest time? This involves helping them work on the right agendas/projects and helping them break down their projects in ways that help them make progress faster (and avoiding ending up tunnel-visioned on the wrong project for months/years).
As of now, the project will focus on building an extension on top of VSCode to make it the ultimate research tool for alignment researchers. VSCode is ideal because researchers are already coding with it and it’s easy to build on top of it. You prevent context-switching, like a web app would cause. I want the entire workflow to feel natural inside of VSCode. In general, I think this will make things easier to build on top of and automate parts of research over time.
Side note: I helped build the Alignment Research Dataset ~2 years ago (here's the extended project). It is now being continually updated, and SQL and vector databases (which will interface with the assistant) are also being used.
If you are interested in potentially helping out (or know someone who might be!), send me a DM with a bit of your background and why you'd like to help out. To keep things focused, I may or may not accept.
I'm also collaborating with different groups (Apart Research, AE Studio, and more). In 2-3 months, I want to get it to a place where I know whether this is useful for other researchers and if we should apply for additional funding to turn it into a serious project.
As an update to the Alignment Research Assistant I'm building, here is a set of shovel-ready tasks I would like people to contribute to (please DM if you'd like to contribute!). These tasks are the ones that are easier to articulate and pretty self-contained:
Core Features
1. Setup the Continue extension for research: https://www.continue.dev/
- Design prompts in Continue that are suitable for a variety of alignment research tasks and make it easy to switch between these prompts
- Figure out how to scaffold LLMs with Continue (instead of just prompting one LLM with additional context)
- It can include agents, search, and more
- Test out models to quickly help with paper writing
2. Data sourcing and management
- Integrate with the Alignment Research Dataset (pulling from either the SQL database or Pinecone vector database): https://github.com/StampyAI/alignment-research-dataset
- Integrate with other apps (Google Docs, Obsidian, Roam Research, Twitter, LessWrong)
- Make it easy to look and edit long prompts for project context
3. Extract answers to questions across multiple papers/posts (feeds into Continue)
- Develop high-quality chunking and scaffolding techniques
- Implement multi-step interaction between researcher and LLM
4. Design Autoprompts for alignment research
- Creates lengthy, high-quality prompts for researchers that get better responses from LLMs
5. Simulated Paper Reviewer
- Fine-tune or prompt LLM to behave like an academic reviewer
- Use OpenReview data for training
6. Jargon and Prerequisite Explainer
- Design a sidebar feature to extract and explain important jargon
- Could maybe integrate with some interface similar to https://delve.a9.io/
7. Setup automated "suggestion-LLM"
- An LLM periodically looks through the project you are working on and tries to suggest *actually useful* things in the side-chat. It will be a delicate balance to make sure not to share too much and cause a loss of focus. This could be custom for the research with an option only to give automated suggestions post-research session.
8. Figure out if we can get a useable browser inside of VSCode (tried quickly with the Edge extension but couldn't sign into the Claude chat website)
- Could make use of new features other companies build (like Anthropic's Artifact feature), but inside of VSCode to prevent context-switching in an actual browser
9. "Alignment Research Codebase" integration (can add as Continue backend)
- Create an easily insertable set of repeatable code that researchers can quickly add to their project or LLM context
- This includes code for Multi-GPU stuff, best practices for codebase, and more
- Should make it easy to populate a new codebase
- Pro-actively gives suggestions to improve the code
- Generally makes common code implementation much faster
10. Notebook to high-quality codebase
- Can go into more detail via DMs.
11. Adding capability papers to the Alignment Research Dataset
- We didn't do this initially to reduce exfohazards. The purpose of adding capability papers (and all the new alignment papers) is to improve the assistant.
- We will not be open-sourcing this part of the work; this part of the dataset will be used strictly by the vetted alignment researchers using the assistant.
Specialized tooling (outside of VSCode)
Bulk fast content extraction
- Create an extension to extract content from multiple tabs or papers
- Simplify the process of feeding content to the VSCode backend for future use
Personalized Research Newsletter
- Create a tool that extracts relevant information for researchers (papers, posts, other sources)
- Generate personalized newsletters based on individual interests (open questions and research they care about)
- Sends pro-active notification in VSCode and Email
Discord Bot for Project Proposals
- Suggest relevant papers/posts/repos based on project proposals
- Integrate with Apart Research Hackathons
I get that adding an interpreter feels like a low bar for neurosymbolic (and that scale helps considerably with making interpreters/tools useful in the first place), but I'd be curious to know what you have in mind when you hear that word.
I'd like to provide some additional context:
- LLM + interpreter is considered neurosymbolic rather than just 'scale.' A weak model couldn't do it with an interpreter, but this was François' point: You need a good DL model to guide the search program.
- For this reason, I think it's unfair if people try to dunk on François with something like, "See, scale is all you need."
- François agrees with me; he liked a tweet I shared saying the above, and said: "Obviously combining a code interpreter (which is a symbolic system of enormous complexity) with a LLM is neurosymbolic. AlphaGo was neurosymbolic as well. These are universally accepted definitions." You can disagree with him on what should be considered neurosymbolic, but I think it's important for us to know what we all mean here even if we have been using the word differently.
- He says more here:
If you are generating lots of programs, checking each one with a symbolic checker (e.g. running the actual code of the program and verifying the output), and selecting those that work, you are doing program synthesis (aka "discrete program search").
The main issue with program synthesis is combinatorial explosion: the "space of all programs" grows combinatorially with the number of available operators and the size of the program.
The best solution to fight combinatorial explosion is to leverage *intuition* over the structure of program space, provided by a deep learning model. For instance, you can use a LLM to sample a program, or to make branching decision when modifying an existing program.
Deep learning models are inexact and need to be complemented with discrete search and symbolic checking, but they provide a fast way to point to the "right" area of program space. They help you navigate program space, so that your discrete search process has less work to do and becomes tractable.
Here's a talk I did at a workshop in Davos in March that goes into these ideas in a bit more detail.
- He says more here:
- GPT-4o was trained on JSONs of the public datasets (which is what Ryan tested on). Unclear how much this could impact performance on the public train and test sets. Would be great to see the performance on the private test set.
- I do think limited amount of compute you can use to win the competition should be taken into context. Perhaps enough compute solves the task and, in practice, this is mostly all that matters (even if a model + scaffolding can't solve ARC-AGI under present rules).
- Given that this setup can't be used for the actual competition, it may be that SoTA (neurosymbolic) models can get a high score a year or more before the models that can enter the competition are allowed.
- Though, you'll probably be able to get a model capable of doing this on a P100 if you first solve it with a larger model and then have the weaker model leverage that.
OpenAI CEO Sam Altman has privately said the company could become a benefit corporation akin to rivals Anthropic and xAI.
"Sam Altman recently told some shareholders that OAI is considering changing its governance structure to a for-profit business that OAI's nonprofit board doesn't control. [...] could open the door to public offering of OAI; may give Altman an opportunity to take a stake in OAI."
Yes, but this is similar to usual startups, it’s a calculated bet you are making. So you expect some of the people to try this will fail, but investors hope one of them will be a unicorn.
We had a similar thought:
But yeah, my initial comment was about how to take advantage of nationalization if it does happen in the way Leopold described/implied.