Posts
Comments
Yeah, it's possible that CoT training unlocks reward hacking in a way that wasn't previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI's "deliberative alignment" but more general).
I think a better analogy than martial arts would be writing. I don't have a lot of experience with writing fiction, so I wouldn't be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I'd be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas "looking at a page of fiction" is a situation that I (and LLMs) are much more familiar with.
There's been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.
This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. AI labs could just plug this into the reward for their reasoning models, reinforcing the reasoning traces leading to responses that obtain higher reward. This seemed to me like a really obvious next step.
Well, it turns out that DeepSeek r1 actually does this. From their paper:
2.3.4. Reinforcement Learning for all Scenarios
To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
This checks out to me. I've already noticed that r1 feels significantly better than other models at creative writing, which is probably due to this human preference training. While o1 was no better at creative writing than other models, this might just mean that OpenAI didn't prioritize training o1 on human preferences. My Manifold market currently puts a 65% chance on chain-of-thought training outperforming traditional LLMs by 2026, and it should probably be higher at this point.
We need to adjust our thinking around reasoning models - there's no strong reason to expect that future models will be much worse at tasks with fuzzy success criteria.
Adapted from my previously-posted question, after cubefox pointed out that DeepSeek is already using RLHF.
Yeah, but there are probably other interesting takeaways
Thanks! Apparently I should go read the r1 paper :)
Yeah, if you're keeping the evaluator aligned solely by not RLing it very much, I feel safer than if you RLed it a lot, but not a lot safer. Ideally there would be some more theoretically-justified reason why we expect it to be safe. Hopefully MONA just works pretty well without any RL training of the evaluator, but if not, RLing the evaluator a little bit is something we could try.
If you're using additional techniques to make the evaluator act more human-like despite its RL, perhaps you could just apply those techniques to the agent model instead.
I think I mostly understand what you're saying. Essentially, MONA training is:
- strong model comes up with a step
- weak model outputs a score
- the step gets executed
- repeat, starting from the new state
Your idea adds "1.5. weak model optionally asks the strong model follow-up questions" and "2.5. weak model optionally changes the step."
If you don't update the weak model at all, this seems fine. However, the strong model will eventually learn to come up with a step which explains itself enough that the weak model's questions are all preemptively answered, and which anticipates the weak model's preferences enough that the weak model never feels the need to change the step. So this is essentially equivalent to MONA.
However, if you do update the weak model, this seems problematic. The equilibrium is that the strong model suggests exactly the actions that maximize reward, and the weak model does exactly the actions that the strong model suggests. (Technically the weak model isn't even incentivized to score the strong model any particular way, but I'm just assuming it gives high scores for suggestions that it decides to carry out.)
I guess if you only train the weak model a small amount, it will do something in the middle, where it's kind of motivated by long-term reward and kind of motivated by its initial values. There's no guarantee on exactly what it does. I think the stipulation that the weak model remains safe "due to its architecture or whatever reason we trust it more" is doing a lot of work here; I'm not sure exactly what this would mean.
This is pretty similar to #1, "incremental steering," in my post on "5 ways to improve CoT faithfulness:" https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness. (I originally thought of this idea before I got involved with the MONA project.)
It's similar to your idea in that it incentivizes the stronger model to come up with outputs that the weaker model can make use of. However, it actually rolls out the weaker model and uses the resulting reward as an objective target, rather than simply asking the weak model how much it likes the strong model's idea. A possible advantage over MONA is that it's more strongly grounded in real-world outcomes, but it has the disadvantages of being less sample-efficient and maybe being vulnerable to the weak model being manipulated into multi-step reward hacking.
Adding onto Rohin's comment, assuming you're interested in using this technique on human-level or slightly superhuman systems:
After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of "debate" on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we're worried this would lead to them colluding.
Adding more details to this picture:
- We should keep a human in the loop as we train the judge and debater models. You can imagine we give a human rater a problem, an intended solution, a partial CoT trace for the problem, and two possible next steps. The human has to choose which step is better.
- We should tell the rater to prioritize not just progress towards the goal, but also understandability. If the agent uses one step to make an inferential leap that the human doesn’t understand (even if correct), this should be considered worse than taking many steps that each make sense and which reach the same conclusion.
- The human's task becomes more difficult the longer the CoT traces are, and obviously intractable at the level of 10T tokens per trace. This is the major reason we need the debate system.
- The debater AIs will go back and forth about which step is the best until the human is convinced one way or the other. Examples of things the debaters could point out:
- “This step sounds insightful, but it’s just rephrasing this sentence from 50K tokens earlier in the CoT.”
- “This step seems irrelevant. Maybe the agent is using steganography?”
- “Although this step is incorrect, it gets the agent closer to the final answer, because…”
- We train the judge to imitate the human, and train the debater to convince the judge that their assigned CoT step is the best.
- Finally, the approval model for MONA is trained on a score derived from the judge model.
- This is all done in an online training setup where the human stays in the loop as the AI systems continue to improve.
This all depends on the assumption that the (human + AI debaters) system isn't so hackable as to render its feedback useless. This assumption seems pretty safe for pre-superhuman agents and <1M token CoTs. Because the above training setup keeps a human in the loop, we can at least conclude that the agent won't learn to take actions that are obviously-bad to humans, e.g. outputting gibberish tokens like "describing.\ + similarlyNow write oppositeley.]"
Maybe with an intelligent enough AI or a long enough CoT, the human rater becomes so confused that their feedback is completely decoupled from "accomplishing tasks in the real world." It's debatable at what point this would happen; however, I think it's pretty clearly easier to hack a human's approval for the end result of a weeks-long task than for a single step. If outcome-based supervision is the alternative, MONA seems like a big improvement.
Heaven banning, where trolls are banished to a fake version of the website filled with bots that pretend to like them, has come to Reddit.
That link is satire, it's a retweet from 2022 of a hypothetical future article in 2024
I would also be excited to see this, commenting to hopefully get notified when this gets posted
Yep, LLMs are stochastic in the sense that there isn't literally a 100% probability of their outputs having any given property. But they could very well be effectively deterministic (e.g. there's plausibly a >99.999% probability that GPT-4's response to "What's the capital of France" includes the string "Paris")
I think it'd be good to put a visible deadline on the fundraising thermometer, now-ish or maybe just in the final week of the fundraiser. It conveys some urgency, like "oh shoot are we going to make the $X funding goal in time? I better donate some money to make that happen."
Multiple deadlines ($2M by date Y, $3M by date Z) might be even better
Maybe someone else could moderate it?
Anthropic's computer use model and Google's Deep Research both do this. Training systems like this to work reliably has been a bottleneck to releasing them
Strong-upvoted just because I didn't know about the countdown trick and it seems useful
I really like your vision for interpretability!
I've been a little pessimistic about mech interp as compared to chain of thought, since CoT already produces very understandable end-to-end explanations for free (assuming faithfulness etc.).
But I'd be much more excited if we could actually understand circuits to the point of replacing them with annotated code or perhaps generating on-demand natural language explanations in an expandable tree. And as long as we can discover the key techniques for doing that, the fiddly cognitive work that would be required to scale it across a whole neural net may be feasible with AI only as capable as o3.
While I'd be surprised to hear about something like this happening, I wouldn't be that surprised. But in this case, it seems pretty clear that o1 is correcting its own mistakes in a way that past GPTs essentially never did, if you look at the CoT examples in the announcement (e.g. the "Cipher" example).
Yeah, I'm particularly interested in scalable oversight over long-horizon tasks and chain-of-thought faithfulness. I'd probably be pretty open to a wide range of safety-relevant topics though.
In general, what gets me most excited about AI research is trying to come up with the perfect training scheme to incentivize the AI to learn what you want it to - things like HCH, Debate, and the ELK contest were really cool to me. So I'm a bit less interested in areas like mechanistic interpretability or very theoretical math
Thanks Neel, this comment pushed me over the edge into deciding to apply to PhDs! Offering to write a draft and taking off days of work are both great ideas. I just emailed my prospective letter writers and got 2/3 yeses so far.
I just wrote another top-level comment on this post asking about the best schools to apply to, feel free to reply if you have opinions :)
I decided to apply, and now I'm wondering what the best schools are for AI safety.
After some preliminary research, I'm thinking these are the most likely schools to be worth applying to, in approximate order of priority:
- UC Berkeley (top choice)
- CMU
- Georgia Tech
- University of Washington
- University of Toronto
- Cornell
- University of Illinois - Urbana-Champaign
- University of Oxford
- University of Cambridge
- Imperial College London
- UT Austin
- UC San Diego
I'll probably cut this list down significantly after researching the schools' faculty and their relevance to AI safety, especially for schools lower on this list.
I might also consider the CDTs in the UK mentioned in Stephen McAleese's comment. But I live in the U.S. and am hesitant about moving abroad - maybe this would involve some big logistical tradeoffs even if the school itself is good.
Anything big I missed? (Unfortunately, the Stanford deadline is tomorrow and the MIT deadline was yesterday, so those aren't gonna happen.) Or, any schools that seem obviously worse than continuing to work as a SWE at a Big Tech company in the Bay Area? (I think the fact that I live near Berkeley is a nontrivial advantage for me, career-wise.)
Thanks, this post made me seriously consider applying to a PhD, and I strong-upvoted. I had vaguely assumed that PhDs take way too long and don't allow enough access to compute compared to industry AI labs. But considering the long lead time required for the application process and the reminder that you can always take new opportunities as they come up, I now think applying is worth it.
However, looking into it, putting together a high-quality application starting now and finishing by the deadline seems approximately impossible? If the deadline were December 15, that would give you two weeks; other schools like Berkeley have even earlier deadlines. I asked ChatGPT how long it would take to apply to just a single school, and it said it would take 43–59 hours of time spent working, or ~4–6 weeks in real time. Claude said 37-55 hours/4-6 weeks.
Not to discourage anyone from starting their application now if they think they can do it - I guess if you're sufficiently productive and agentic and maybe take some amphetamines, you can do anything. But this seems like a pretty crazy timeline. Just the thought of asking someone to write me a recommendation letter in a two-week timeframe makes me feel bad.
Your post does make me think "if I were going to be applying to a PhD next December, what would I want to do now?" That seems pretty clarifying, and would probably be a helpful frame even if it turns out that a better opportunity comes along and I never apply to a PhD.
I think it'd be a good idea for you to repost this in August or early September of next year!
Thanks for making this point @abramdemski, and thanks for the prompt to think about this more @Dakara. After some reflection, I think the idea that "prompting an LLM with a human-interpretable prompt is more efficient than using a hyper-optimized nonsense prompt" is a really interesting and potentially important hypothesis to test, and I actually am very uncertain whether or not it's true.
We could do an experiment that looks something like this:
- Come up with a list of ways we might want an LLM to respond, such as "Every paragraph should contain exactly one sentence that ends with a period" or "The first letter of each sentence in your response should repeatedly spell ABCABCABC...". These ways of responding should be automatically checkable.
- Make sure that the LLM can follow all of these instructions pretty robustly when they're given in plain English.
- Replicate the method in https://llm-attacks.org (GCG), but with the goal of getting the model to follow these instructions rather than jailbreaking it. It looks like they actually have the full code available, which could make this much easier.
- Make sure that your version of GCG is incentivized to write prompts that are as short as possible (I'm not sure whether this is already the case).
- The reason that this is very different from jailbreaking with GCG is that the LLM isn't actively trained against responding in these ways. So we should anticipate that the prompt won't have to be as long as it would be if we were jailbreaking.
- Determine the answer to the question: is the shortest natural language prompt that makes the LLM follow a given method of responding longer than the prompt output by GCG?
I put this on my list of experiments to try, but I may or may not get to it soon. Anyone reading this, feel free to steal this idea! Just let me know beforehand by commenting, so we don't duplicate work
The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.
For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.
However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.
In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".
Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."
At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.
Another idea would be to only use the optimized planner to write a single randomly chosen step in the CoT.
The only difference between this and my proposal is that the steps up until that randomly chosen step are written by the frozen planner, right? This variant would just make it so the optimized planner is now OOD during deployment, without any advantage AFAICT.
The frozen planner may react to steganography, in the same way that image-recognition NNs react to adversarial examples.
This is a good point!
Still, IMO, exploiting the frozen planner through adversarial inputs in a single step seems pretty unlikely to be a fruitful strategy for the optimized planner. If the optimized planner is simply trying to accurately convey information to the frozen planner, probably >99% of that information is best to convey through human-understandable text.
This is especially likely to hold under the following conditions:
- The step size should be small. I alluded to this in the post, but it seems very hard to select a single token that conveys some nonobvious information to the frozen planner in a way that somehow improves performance. The more tokens per step, the easier this becomes.
- We should apply regularization to the planner, to make it stay somewhat close to the base LLM's logprobs. That can rule out especially crazy sequences of tokens like in the research at https://llm-attacks.org.
- We can continue training the frozen planner on human data, where the humans write continuations of CoT prefixes from the optimized planner. That way, if the optimized planner puts the frozen planner off distribution, the humans can provide examples that will put it back on track. This data collection might be pretty expensive though - maybe there's a smart way to choose optimized CoT prefixes that will be most helpful to collect human continuations for.
If the initial paraphraser doesn't change between '!' and '.', then learning a canonical-paraphraser from that initial paraphraser won't do it either.
...
So the canonical-paraphrasing idea helps somewhat, but fundamentally, the better the paraphraser (the more closely it captures the human notion of semantic equivalence), the better the anti-steganography guarantee we can get.
What I meant is that we train the canonical paraphraser based on the "human notion of semantic equivalence," NOT the "initial paraphraser."
One way of doing this is to train another model (call it the "adversarial rephraser") at the same time that you train the canonical rephraser. The adversarial rephraser takes a single sentence s as input, and tries to pick another sentence s' such that either 1) s has the same meaning as s' but CR(s) != CR(s'), or 2) s has a different meaning from s' but CR(s) = CR(s'). Semantic equivalence is determined by some oracle - probably an LLM like GPT-4 is sufficient, but you could also fine-tune a model on human data specifically for this task. We then train the canonical rephraser against the adversarial examples.
The adversarial rephraser should be able to discover strategies like changing '.' to '!', and the canonical rephraser will learn to account for those strategies.
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo's doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I'm optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).
918,367 kg
An average chimp is 45 kg
918,367 kg / 45 (kg / chimp)
= 20,408 chimps
Based on my limited experience with lucid dreaming, my impression is that, roughly, whatever you expect to happen in a dream will happen. This includes things like the "lessons" in this post. As far as I know, there's no particular reason that lucid dreams have to be easier for women or people ages 20-30, or that you can't transform into someone else in a lucid dream (it's happened to me). But if you convince yourself of these lessons, then they'll be true for you, and you'll be limiting yourself for no reason.
Your example wouldn't be true, but "Dragons attack Paris" would be, interpreted as a statement about actual dragons' habits
Yep, or use method #2 on my list to make the paraphraser remove as much information as possible
Yeah, this is actually very similar to an earlier version of incremental steering, before I thought of making the "frozen planner" jump in and finish the reasoning process.
The problem is, even if we suppose that the Shoggoth won't be at all motivated to hide its reasoning from the Face (i.e. ignoring what @ryan_greenblatt brought up in his comment), the Shoggoth still might use steganography in order to make its thoughts more compact (or just because it's not particularly motivated to make everything human-understandable).
Even if we don't reward the planner directly for compactness, the reality is that we will have to choose some cutoff point during training so its thoughts don't run literally forever, which gives it some reason to be succinct.
I figured out a workaround to edit my post - it's up to date now
I just realized this post is actually one of my earlier drafts, and a bug is keeping me from editing the post.
The final version of my post is here: https://docs.google.com/document/d/1gHaCHhNOUBqxcXxLeTQiqRjG_y7ELkBdM2kcySKDi_Q/edit
I just wrote a response to this post listing 5 specific ways we could improve CoT faithfulness: https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness
> seems likely that o1 was trained with supervision on the individual CoT steps
OpenAI directly says that they didn't do that
This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they "cannot train any policy compliance or user preferences onto the chain of thought". Arguably, this doesn't eliminate the possibility that they train a PRM using human raters' opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it's following a policy. Or even if they don't directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.
I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.
I feel like research showing that CoTs aren't faithful isn't meant to say "we should throw out the CoT." It's more like "naively, you'd think the CoT is faithful, but look, it sometimes isn't. We shouldn't take the CoT at face value, and we should develop methods that ensure that it is faithful."
Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn't include a lot of distracting junk that doesn't really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).
I don't actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as "French is spoken in Paris." It wouldn't be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.
If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can't follow at all, I'd argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.
Yeah, you kind of have to expect from the beginning that there's some trick, since taken literally the title can't actually be true. So I think it's fine
I don't think so, just say as prediction accuracy approaches 100%, the likelihood that the mind will use the natural abstraction increases, or something like that
If you read the post I linked, it probably explains it better than I do - I'm just going off of my memory of the natural abstractions agenda. I think another aspect of it is that all sophisticated-enough minds will come up with the same natural abstractions, insofar as they're natural.
In your example, you could get evidence that 0 and 1 voltages are natural abstractions in a toy setting by:
- Training 100 neural networks to take the input voltages to a program and return the resulting output
- Doing some mechanistic interpretability on them
- Demonstrating that in every network, values below 2.5V are separated from values above 2.5V in some sense
See "natural abstractions," summarized here: https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1
In your example, it makes more sense to treat voltages <2.5 and >2.5 as different things, rather than <5.0 and >5.0, because the former helps you predict things about how the computer will behave. That is, those two ranges of voltage are natural abstractions.
It can also be fun to include prizes that are extremely low-commitment or obviously jokes/unlikely to ever be followed up on. Like "a place in my court when I ascend to kinghood" from Alexander Wales' Patreon
This is great! Maybe you'd get better results if you "distill" GPT2-LN into GPT2-noLN by fine-tuning on the entire token probability distribution on OpenWebText.
Just curious, why do you spell "useful" as "usefwl?" I googled the word to see if it means something special, and all of the top hits were your comments on LessWrong or the EA Forum
If I understand correctly, you're basically saying:
- We can't know how long it will take for the machine to finish its task. In fact, it might take an infinite amount of time, due to the halting problem which says that we can't know in advance whether a program will run forever.
- If our machine took an infinite amount of time, it might do something catastrophic in that infinite amount of time, and we could never prove that it doesn't.
- Since we can't prove that the machine won't do something catastrophic, the alignment problem is impossible.
The halting problem doesn't say that we can't know whether any program will halt, just that we can't determine the halting status of every single program. It's easy to "prove" that a program that runs an LLM will halt. Just program it to "run the LLM until it decides to stop; but if it doesn't stop itself after 1 million tokens, cut it off." This is what ChatGPT or any other AI product does in practice.
Also, the alignment problem isn't necessarily about proving that a AI will never do something catastrophic. It's enough to have good informal arguments that it won't do something bad with (say) 99.99% probability over the length of its deployment.
Reading the Wikipedia article for "Complete (complexity)," I might have misinterpreted what "complete" technically means.
What I was trying to say is "given Sora, you can 'easily' turn it into an agent" in the same way that "given a SAT solver, you can 'easily' turn it into a solver for another NP-complete problem."
I changed the title from "OpenAI's Sora is agent-complete" to "OpenAI's Sora is an agent," which I think is less misleading. The most technically-correct title might be "OpenAI's Sora can be transformed into an agent without additional training."
That sounds more like "AGI-complete" to me. By "agent-complete" I meant that Sora can probably act as an intelligent agent in many non-trivial settings, which is pretty surprising for a video generator!
First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.
Can we solve this problem by setting the default action to "do nothing," then giving the agent an extra action to "do nothing and give the shutdown subagent +1 reward?"
I think the implication was that "high-status men" wouldn't want to hang out with "low-status men" who awkwardly ask out women
On the topic of AI for forecasting: just a few days ago, I made a challenge on Manifold Markets to try to incentivize people to create Manifold bots to use LLMs to forecast diverse 1-month questions accurately, with improving epistemics as the ultimate goal.
You can read the rules and bet on the main market here: https://manifold.markets/CDBiddulph/will-there-be-a-manifold-bot-that-m?r=Q0RCaWRkdWxwaA
If anyone's interested in creating a bot, please join the Discord server to share ideas and discuss! https://discord.com/channels/1193303066930335855/1193460352835403858