OpenAI's Sora is an agent 2024-02-16T07:35:52.171Z
Is Metaethics Unnecessary Given Intent-Aligned AI? 2023-09-02T09:48:54.934Z
Preparing for AI-assisted alignment research: we need data! 2023-01-17T03:28:29.778Z
The Rational Utilitarian Love Movement (A Historical Retrospective) 2022-11-03T07:11:28.679Z


Comment by CBiddulph (caleb-biddulph) on OpenAI's Sora is an agent · 2024-02-16T16:30:12.351Z · LW · GW

Reading the Wikipedia article for "Complete (complexity)," I might have misinterpreted what "complete" technically means.

What I was trying to say is "given Sora, you can 'easily' turn it into an agent" in the same way that "given a SAT solver, you can 'easily' turn it into a solver for another NP-complete problem."

I changed the title from "OpenAI's Sora is agent-complete" to "OpenAI's Sora is an agent," which I think is less misleading. The most technically-correct title might be "OpenAI's Sora can be transformed into an agent without additional training."

Comment by CBiddulph (caleb-biddulph) on OpenAI's Sora is an agent · 2024-02-16T09:01:37.708Z · LW · GW

That sounds more like "AGI-complete" to me. By "agent-complete" I meant that Sora can probably act as an intelligent agent in many non-trivial settings, which is pretty surprising for a video generator!

Comment by CBiddulph (caleb-biddulph) on A Shutdown Problem Proposal · 2024-01-22T21:06:23.703Z · LW · GW

First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.

Can we solve this problem by setting the default action to "do nothing," then giving the agent an extra action to "do nothing and give the shutdown subagent +1 reward?"

Comment by CBiddulph (caleb-biddulph) on The impossible problem of due process · 2024-01-16T20:37:33.669Z · LW · GW

I think the implication was that "high-status men" wouldn't want to hang out with "low-status men" who awkwardly ask out women

Comment by CBiddulph (caleb-biddulph) on Project ideas: Epistemics · 2024-01-07T07:58:29.133Z · LW · GW

On the topic of AI for forecasting: just a few days ago, I made a challenge on Manifold Markets to try to incentivize people to create Manifold bots to use LLMs to forecast diverse 1-month questions accurately, with improving epistemics as the ultimate goal.

You can read the rules and bet on the main market here:

If anyone's interested in creating a bot, please join the Discord server to share ideas and discuss!

Comment by CBiddulph (caleb-biddulph) on An explanation for every token: using an LLM to sample another LLM · 2023-10-12T08:11:33.410Z · LW · GW

Thanks for the post! I had a similar idea which might let you maintain (or improve) accuracy while still getting the benefit of explanations - basically fine-tune the model on explanations that make it most likely to output the correct token.

For instance, you might have it to fill in the text between <think> and </think> on a bunch of text examples like this: "The capital of the country northeast of Spain is <think> The country northeast of Spain is France, and its capital is Paris </think> Paris".

You make the LLM come up with, say, 10 explanations each time, and choose the one that maximizes the logprob of the correct token ("Paris") immediately after </think>. Then fine-tune it to complete prompts like "The capital of the country northeast of Spain is <think>" with completions like "The country northeast of Spain is France, and its capital is Paris </think> Paris". Then, generate more completions with the fine-tuned model and fine-tune it yet again with the best completions. Rinse and repeat.

Eventually, the LLM should relearn how to guess the next token (at worst, it can just ignore everything in its "think" stage and get identical performance, so it should be able to learn this easily enough).

Alternatively, instead of training to maximize the correct token's logprob, you can choose prompts that minimize the perplexity over many tokens, say 100 tokens. This will be more useful in practice, since we'll probably mostly want explanations of the LLM's thinking across its entire response rather than in the very short term.

This helps because the LLM will actually have some "skin in the game" instead of just blathering on arbitrarily; the disadvantage is that it could eventually learn to optimize its thinking by encoding additional information in its explanation through steganography. Still, this certainly seems like an improvement on inscrutable matrices.

You could try to enforce that the LLM only uses the semantic content of its explanation by making another LLM rephrase it, translating it to another language, etc. This kind of approach wouldn't stop an adversary actively trying to hide its thoughts from you, but it seems like it would prevent most issues for now.

Comment by CBiddulph (caleb-biddulph) on Stampy's AI Safety Info soft launch · 2023-10-06T04:42:07.653Z · LW · GW

Thanks for doing this, this looks like it'll be very helpful for beginners in AI safety, and the content looks great!

I don't know if this will be addressed in your UI redesign, but I find the UI very counterintuitive. The main problem is that when I open and then close a tab, I expect every sub-tab to collapse and return to the previous state. Instead, the more tabs I open, the more cluttered the space gets, and there's no way to undo it unless I remove the back part of the URL and reload, or click the Stampy logo.

In addition, it's impossible to tell which tab was originally nested under which parent tab, which makes it much more difficult to navigate. And confusingly, sometimes there are "random" tabs that don't necessarily follow directly from their parent tabs (took me a while to figure this out). On a typical webpage, I could imagine thinking "this subtopic is really interesting; I'm going to try to read every tab under it until I'm done," but these design choices are pretty demotivating for that.

I don't have a precise solution in mind, but maybe it would help to color-code different kinds of tabs (maybe a color each for root tabs, leaf tabs, non-root branching tabs, and "random" tabs). You could also use more than two visual layers of nesting - if you're worried about tabs getting narrower and narrower, maybe you could animate the tab expanding to full width and then sliding back into place when it's closed. Currently an "unread" tab is represented by a slight horizontal offset, but you could come up with another visual cue for that. I guess doing lots of UX interviews and A/B testing will be more helpful than anything I could say here.

Comment by CBiddulph (caleb-biddulph) on Petrov Day Retrospective, 2023 (re: the most important virtue of Petrov Day & unilaterally promoting it) · 2023-09-29T03:53:53.550Z · LW · GW

Came here to say this - I also clicked the link because I wanted to see what would happen. I wouldn't have done it if I hadn't already assumed it was a social experiment.

Comment by CBiddulph (caleb-biddulph) on Some reasons why I frequently prefer communicating via text · 2023-09-19T22:49:08.047Z · LW · GW

No, it makes sense to me. I have no idea why you were downvoted

Comment by CBiddulph (caleb-biddulph) on Who Has the Best Food? · 2023-09-06T18:26:15.695Z · LW · GW

I'd be interested in the full post!

Comment by CBiddulph (caleb-biddulph) on Who Has the Best Food? · 2023-09-05T22:56:55.303Z · LW · GW

You cannot go in with zero information, but if you know how to read Google Maps and are willing to consider several options, you can do very well overall, although many great places are still easy to miss.

How do you read Google Maps, beyond picking something with a high average star rating and (secondarily) large number of reviews? Since the vast majority of customers don't leave reviews, it seems like the star rating should be biased, but I'm not sure in what way or how to adjust for it.

Comment by CBiddulph (caleb-biddulph) on Meta Questions about Metaphilosophy · 2023-09-02T09:51:49.415Z · LW · GW

Thanks for the post! I just published a top-level post responding to it:

I'd appreciate your feedback!

Comment by CBiddulph (caleb-biddulph) on Meta Questions about Metaphilosophy · 2023-09-01T23:22:52.296Z · LW · GW

Can't all of these concerns be reduced to a subset of the intent-alignment problem? If I tell the AI to "maximize ethical goodness" and it instead decides to "implement plans that sound maximally good to the user" or "maximize my current guess of what the user meant by ethical goodness according to my possibly-bad philosophy," that is different from what I intended, and thus the AI is unaligned.

If the AI starts off with some bad philosophy ideas just because it's relatively unskilled in philosophy vs science, we can expect that 1) it will try very hard to get better at philosophy so that it can understand "what did the user mean by 'maximize ethical goodness,'" and 2) it will try to preserve option value in the meantime so not much will be lost if its first guess was wrong. This assumes some base level of competence on the AI's part, but if it can do groundbreaking science research, surely it can think of those two things (or we just tell it).

Comment by CBiddulph (caleb-biddulph) on Secret Cosmos: Introduction · 2023-07-20T18:05:07.337Z · LW · GW

I'm not Eliezer, but thanks for taking the time to read and engage with the post!

The best explanation I can give for the downvotes is that we have a limited amount of space on the front page of the site, and we as a community want to make sure people see content that will be most useful to them. Unfortunately, we simply don't have enough time to engage closely with every new user on the site, addressing every objection and critique. If we tried, it would get difficult for long-time users to hear each other over the stampede of curious newcomers drawn here recently from our AI posts :) By the way, I haven't downvoted your post; I don't think there's any point once you've already gotten this many, and I'd rather give you a more positive impression of the community than add my vote to the pile.

I'm sure you presented your ideas with the best of intentions, but it's hard to tell which parts of your argument have merit behind them. In particular, you've brought up many arguments that have been partially addressed in popular LessWrong posts that most users have already read. Your point about certainty is just one example.

Believe me, LessWrong LOVES thinking about all the ways we could be wrong (maybe we do it a little too much sometimes). We just have a pretty idiosyncratic way we like to frame things. If someone comes along with ideas for how to improve our rationality, they're much more likely to be received well if they signal that they're familiar with the entire "LessWrong framework of rationality," then explain which parts of it they reject and why.

The common refrain for users who don't know this framework is to "read the Sequences." This is just a series of blog posts written by Eliezer in the early days of LessWrong. In the Sequences, Eliezer wrote a lot about consciousness, AI, and other topics you brought up - I think you'd find them quite interesting, even if you disagree with them! You could get started at If you can make your way through those, I think you'll more than deserve the right to post again with new critiques on LessWrong-brand rationality - I look forward to reading them!

Comment by CBiddulph (caleb-biddulph) on Secret Cosmos: Introduction · 2023-07-19T17:00:07.854Z · LW · GW

Try reading this post?

Comment by CBiddulph (caleb-biddulph) on [Linkpost] Introducing Superalignment · 2023-07-07T00:43:11.294Z · LW · GW

Yeah, I'm mostly thinking about potential hires.

Comment by CBiddulph (caleb-biddulph) on [Linkpost] Introducing Superalignment · 2023-07-07T00:31:55.788Z · LW · GW

I see what you mean. I was thinking "labs try their hardest to demonstrate that they are working to align superintelligent AI, because they'll look less responsible than their competitors if they don't."

I don't think keeping "superalignment" techniques secret would generally make sense right now, since it's in everyone's best interests that the first superintelligence isn't misaligned (I'm not really thinking about "alignment" that also improved present-day capabilities, like RLHF).

As for your second point, I think that for an AI lab that wants to improve PR, the important thing is showing "we're helping the alignment community by investing significant resources into solving this problem," not "our techniques are better than our competitors'." The dynamic you're talking about might have some negative effect, but I personally think the positive effects of competition would vastly outweigh it (even though many alignment-focused commitments from AI labs will probably turn out to be not-very-helpful signaling).

Comment by CBiddulph (caleb-biddulph) on [Linkpost] Introducing Superalignment · 2023-07-06T20:57:08.468Z · LW · GW

Competition between labs on capabilities is bad; competition between labs on alignment would be fantastic.

Comment by CBiddulph (caleb-biddulph) on Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies · 2023-05-28T20:48:23.780Z · LW · GW

This post seems interesting and promising, thanks for writing it!

The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting.

I think this could be straightforwardly solved by not training two different models at all, but by giving two instances of the same model inputs that are both slightly perturbed in the same random way. Then, neither instance of the model would ever have a predictable advantage over the other.

For instance, in your movie recommendation example, let's say the model takes a list of 1000 user movie ratings as input. We can generate a perturbed input by selecting 10 of those ratings at random and modifying them, say by changing a 4-star rating to a 5-star rating. We do this twice to get two different inputs, feed them into the model, and train based on the outputs as you described.

Another very similar solution would be to randomly perturb the internal activations of each neural network during training.

Does this seem right?

Comment by CBiddulph (caleb-biddulph) on Prizes for matrix completion problems · 2023-05-06T01:49:52.554Z · LW · GW

Love to see a well-defined, open mathematical problem whose solution could help make some progress on AI alignment! It's like a little taste of not being a pre-paradigmic field. Maybe someday, we'll have lots of problems like this that can engage the broader math/CS community, that don't involve so much vague speculation and philosophy :)

Comment by CBiddulph (caleb-biddulph) on GPT-4 is bad at strategic thinking · 2023-03-28T04:50:07.856Z · LW · GW

This is also basically an idea I had - I actually made a system design and started coding it, but haven't made much progress due to lack of motivation... Seems like it should work, though

Comment by CBiddulph (caleb-biddulph) on Microsoft Research Paper Claims Sparks of Artificial Intelligence in GPT-4 · 2023-03-26T02:44:37.877Z · LW · GW

The title and the link in the first paragraph should read "Sparks of Artificial General Intelligence"

Comment by CBiddulph (caleb-biddulph) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-13T21:41:18.280Z · LW · GW

I assumed it was primarily because Eliezer "strongly approved" of it, after being overwhelmingly pessimistic about pretty much everything for so long.

I didn't realize it got popular elsewhere, that makes sense though and could help explain the crazy number of upvotes. Would make me feel better about the community's epistemic health if the explanation isn't that we're just overweighting one person's views.

Comment by CBiddulph (caleb-biddulph) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-06T08:38:24.250Z · LW · GW

This looks like exciting work! The anomalous tokens are cool, but I'm even more interested in the prompt generation.

Adversarial example generation is a clear use case I can see for this. For instance, this would make it easy to find prompts that will result in violent completions for Redwood's violence-free LM.

It would also be interesting to see if there are some generalizable insights about prompt engineering to be gleaned here. Say, we give GPT a bunch of high-quality literature and notice that the generated prompts contain phrases like "excerpt from a New York Times bestseller". (Is this what you meant by "prompt search?")

I'd be curious to hear how you think we could use this for eliciting latent knowledge.

I'm guessing it could be useful to try to make the generated prompt as realistic (i.e. close to the true distribution) as possible. For instance, if we were trying to prevent a model from saying offensive things in production, we'd want to start by finding prompts that users might realistically use rather than crazy edge cases like "StreamerBot". Fine-tuning the model to try to fool a discriminator a la GAN comes to mind, though there may be reasons this particular approach would fail.

Sounds like you might be planning to update this post once you have more results about prompt generation? I think a separate post would be better, for increased visibility, and also since the content would be pretty different from anomalous tokens (the main focus of this post).

Comment by CBiddulph (caleb-biddulph) on Experiment Idea: RL Agents Evading Learned Shutdownability · 2023-01-17T18:33:14.376Z · LW · GW

This was interesting to read, and I agree that this experiment should be done!

Speaking as another person who's never really done anything substantial with ML, I do feel like this idea would be pretty feasible by a beginner with just a little experience under their belt. One of the first things that gets recommended to new researchers is "go reimplement an old paper," and it seems like this wouldn't require anything new as far as ML techniques go. If you want to upskill in ML, I'd say get a tiny bit of advice from someone with more experience, then go for it! (On the other hand, if the OP already knows they want to go into software engineering, AI policy, professional lacrosse, etc. I think someone else who wants to get ML experience should try this out!)

The mechanistic interpretability parts seem a bit harder to me, but Neel Nanda has been making some didactic posts that could get you started. (These posts might all be for transformers, but as you mentioned, I think your idea could be adapted to something a transformer could do. E.g. on each step the model gets a bunch of tokens representing the gridworld state; a token representing "what it hears," which remains a constant unique token when it has earbuds in; and it has to output a token representing an action.)

Not sure what the best choice of model would be. I bet you can look at other AI safety gridworld papers and just do what they did (or even reuse their code). If you use transformers, Neel has a Python library (called EasyTransformer, I think) that you can just pick up and use. As far as I know it doesn't have support for RL, but you can probably find a simple paper or code that does RL for transformers.

Comment by CBiddulph (caleb-biddulph) on How it feels to have your mind hacked by an AI · 2023-01-13T04:50:16.690Z · LW · GW

Strong upvote + agree. I've been thinking this myself recently. While something like the classic paperclip story seems likely enough to me, I think there's even more justification for the (less dramatic) idea that AI will drive the world crazy by flailing around in ways that humans find highly appealing.

LLMs aren't good enough to do any major damage right now, but I don't think it would take that much more intelligence to get a lot of people addicted or convinced of weird things, even for AI that doesn't have a "goal" as such. This might not directly cause the end of the world, but it could accelerate it.

The worst part is that AI safety researchers are probably just the kind of people to get addicted to AI faster than everyone else. Like, not only do they tend to be socially awkward and everything blaked mentioned, they're also just really interested in AI.

As much as it pains me to say it, I think it would be better if any AI safety people who want to continue being productive just swore off recreational AI use right now.

Comment by CBiddulph (caleb-biddulph) on Intent alignment should not be the goal for AGI x-risk reduction · 2022-10-26T06:30:46.367Z · LW · GW

I think the danger of intent-alignment without societal-alignment is pretty important to consider, although I'm not sure how important it will be in practice. Previously, I was considering writing a post about a similar topic - something about intent-level alignment being insufficient because we hadn't worked out metaethical issues like how to stably combine multiple people's moral preferences and so on. I'm not so sure about this now, because of an argument along the lines of "given that it's aligned with a thoughtful, altruistically motivated team, an intent-aligned AGI would be able to help scale their philosophical thinking so that they reach the same conclusions they would have come to after a much longer period of reflection, and then the AGI can work towards implementing that theory of metaethics."

Here's a recent post that covers at least some of these concerns (although it focuses more on the scenario where one EA-aligned group develops an AGI that takes control of the future):

I could see the concerns in this post being especially important if things work out such that a full solution to intent-alignment becomes widely available (i.e. easily usable by corporations and potential bad actors) and takeoff is slow enough for these non-altruistic entities to develop powerful AGIs pursuing their own ends. This may be a compelling argument for withholding a solution to intent-alignment from the world if one is discovered.

Comment by CBiddulph (caleb-biddulph) on Contra Hofstadter on GPT-3 Nonsense · 2022-06-17T16:07:16.553Z · LW · GW

Please let us know if they respond!

Comment by CBiddulph (caleb-biddulph) on Prizes for ELK proposals · 2022-01-13T03:31:19.493Z · LW · GW

It seems fair to call it a subcomponent, yeah

Comment by CBiddulph (caleb-biddulph) on Prizes for ELK proposals · 2022-01-12T22:08:34.377Z · LW · GW

Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those actions will play out (for example, activating a trap door to eliminate a robber trying to steal the diamond), and then generates a video showing precisely what will happen.

I'd like to try making a correction here, though I might make some mistakes too.

The predictor is different from the AI that protects the diamond and doesn't try to "choose" actions in order to accomplish any particular goal. Rather, it takes a starting video and a set of actions as input, then returns a prediction of what the ending video would be if those actions were carried out.

An agent could use this predictor to choose a set of actions that leads to videos that a human approves of, then carry out these plans. It could use some kind of search policy, like Monte-Carlo Tree Search, or even just enumerate through every possible action and figure out which one seems to be the best. For the purposes of this problem, we don't really care; we just care that we have a predictor that uses some model of the world (which might take the form of a Bayes net) to guess what the output video will be. Then, the reporter can use the model to answer any questions asked by the human.

Comment by CBiddulph (caleb-biddulph) on Prizes for ELK proposals · 2022-01-12T20:06:02.499Z · LW · GW

I was talking about ELK in a group, and the working example of the SmartVault and the robber ended up being a point of confusion for us. Intuitively, it seems like the robber is an external, adversarial agent who tries to get around the SmartVault. However, what we probably care about in practice would be how a human could be fooled by an AI - not by some other adversary. Furthermore, it seems that whether the robber decides to cover up his theft of the diamond by putting up a screen depends solely on the actions of the AI. Does this imply that the robber is "in kahoots" with the AI in this situation (i.e. the AI projects a video onto the wall instructing the robber to put up a screen)? This seems a bit strange and complicated.

Instead, we might consider the situation in which the AI controls a SmartFabricator, which we want to arrange carbon atoms into diamonds. We might then imagine that it instead fabricates a screen to put in front of the camera, or makes a fake diamond. This wouldn't require the existence of an external "robber" agent. Does the SmartVault scenario have helpful aspects that the SmartFabricator example lacks?