Posts

What working on AI safety taught me about B2B SaaS sales 2025-02-04T20:50:19.990Z
purple fire's Shortform 2025-02-03T03:00:00.293Z

Comments

Comment by purple fire (jack-edwards) on What working on AI safety taught me about B2B SaaS sales · 2025-02-07T05:18:27.898Z · LW · GW

Are you including models that are only used by their creator firm? I work as an ML researcher in big tech (I want to keep this account anon, but it's one of MSFT/OAI, DM, Meta, Anthropic, xAI) and have access to tooling substantially better than what's commercially available (proof by existence?), but that's not really what my post is about. My main model for this actually panning out is something like:

  • Big tech company has control over AI lab
  • AI lab makes cracked SWE agent
  • Big tech company notices that releasing that SWE agent will undermine the rest of their software development business, so instead of licensing it out they only make it available to their own staff and perhaps business allies

I'm just clarifying because it's barely even confidential information that engineers at AI labs have better models than engineers at small or mid-size tech firms, and I want to check what you're actually betting on.

Comment by purple fire (jack-edwards) on What working on AI safety taught me about B2B SaaS sales · 2025-02-05T20:39:34.661Z · LW · GW

Sorry, I can elaborate better on the situation. The big tech companies know that they can pay way more than smaller competitors, so they do. But then that group of megacorp tech (Google, Amazon, Meta, etc.) collude with each other to prevent runaway race dynamics. This is how they're able to optimize their costs with the constraint of salaries being high enough to stifle competition. Here, I was just offering evidence for my claim that big tech is a monopsonistic cartel in the SWE labor market, it isn't really evidence one way or another for the claims I make in the original post.

Comment by purple fire (jack-edwards) on What working on AI safety taught me about B2B SaaS sales · 2025-02-05T19:58:30.550Z · LW · GW

If https://outtalent.com/us50/ is to be believed, SWE engineers look pretty concentrated at the top ~5 companies and their subsidiaries. Do you think that data is incorrect?

Concretely, I would claim that >80% of the most skilled software engineers in the US work at <10 companies. Edit: I thought about it more and I think this is actually more like 65% at the 10 biggest companies, but that doesn't change my central claims.

I also disagree with your claim that they are not a cartel. I think the biggest tech companies collude to fix wages so that they are sufficiently higher than every other company's salaries to stifle competition, while also limiting race dynamics to maintain profits. I think this is done in the form of selectively enforced non-competes, illegal non-poaching agreements, and other shady practices. This has been alleged in court and the companies just settle every time, e.g. https://www.nytimes.com/2014/03/01/technology/engineers-allege-hiring-collusion-in-silicon-valley.html?unlocked_article_code=1.uk4.A5Sn.q5fVDfF_q8Wk&smid=url-share

For those disagreeing--
1. I continue to believe that tech companies derive much of their economic power from cornering the skilled engineering labor market,

2. this is highly threatened by the advent of AI capable of coding,

3. and thus many big tech companies have massive economic incentives to limit the general public's access to models that can code well.

If I changed my mind about any of those 3 points, I would change my mind about the main post. Rather than downvoting, or in addition to it, can you please explain which part you disagree with and why? It will be more productive for everyone and I am open to changing my mind.

Comment by purple fire (jack-edwards) on What working on AI safety taught me about B2B SaaS sales · 2025-02-05T06:57:36.581Z · LW · GW

I also think monopolizing talent enables software companies to make sure those high fixed costs stay nice and high.

If you disagreed with this, is it because you think it is literally false or because you don't agree with the implied argument that software companies are doing this on purpose?

Comment by purple fire (jack-edwards) on What working on AI safety taught me about B2B SaaS sales · 2025-02-05T06:56:41.386Z · LW · GW

Hm, this violates my model of the world.

there are too many AI companies for this deal to work on all of them

Realistically, I think there are like 3-4 labs[1] that matter, OAI, DM, Anthropic, Meta.

some of these AI companies will have strong kinda-ideological commitments to not doing this

Even if that was true, they will be at the whim of investors who are almost all big tech companies.

this is better done by selling (even at a lower revenue) to anyone who wants an AI SWE than selling just to Oracle.

This is the explicit claim I was making with the WTP argument. I think this is firmly not true, and OpenAI will make more money by selling just to Oracle. What evidence causes you to disagree?

  1. ^

    American/Western labs.

Comment by purple fire (jack-edwards) on What working on AI safety taught me about B2B SaaS sales · 2025-02-05T01:45:39.677Z · LW · GW

I don't disagree with most of what you said, maybe I should have been more explicit about some of the points related to that. In particular, I do think "the success of B2B SaaS over bespoke solutions is adequately explained by economies of scale" is true. But I think the reason there are economies of scale is that there are really high fixed costs and really low variable costs. I also think monopolizing talent enables software companies to make sure those high fixed costs stay nice and high.

With AI, engineering talent becomes cheap and plentiful. When that happens, fixed costs will plummet unless firms can control access to AI. If fixed costs plummet, economies of scale go away and the savings from the SaaS model get outweighed by the marginal benefit of bespoke solutions.

what typically happens is that Amir forks out for Slack, or some competitor, while Amir's engineers work on software that generates revenue.

To push back a little on this, as software companies grow they do try to do this less and less. How much enterprise software do you think Microsoft or Google is outsourcing? As soon as it becomes a little bit of a dependence they usually just acquire the company.

In fairness, I don't think this process will be rapid, nothing in B2B SaaS is. But I think tech companies see it on the horizon.

Comment by purple fire (jack-edwards) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-04T18:45:31.028Z · LW · GW

Sure, I think social media is probably the best example of this. Suppose there are two platforms, A and B, and social media sites are worth more when more people are on it. Our "resource allocation" problem is to maximize utility, so we want to get everyone on the same site. There are two equilibria here; we can either set the price for A much higher than B and everyone will move to B, or vice versa.

If the demand functions weren't interdependent and every agent just got some amount of utility from A and some amount of utility from B, there would be exactly one equilibrium price.

Comment by purple fire (jack-edwards) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-04T16:09:31.629Z · LW · GW

Hm, I think that's more of a supply independence thing, what economists would call "non-excludable". If the government funds a police force, it's not as if they protect some citizens but not others. But that's not a violation of the demand independence assumption because I care about living in a country with a strong police force regardless of whether you want that or not.

Goods with demand independence, from Ferraris to Facebook, generally do get provided by markets in real life, they just don't have stable prices. It breaks some of the equilibrium models because it can cause divergences or circularity in your demand function, and then there's no fixed point in positive price/demand space.

Luxury is a good example of this that happens in real life. Here's an intuition-prompting setup:

  1. Suppose I'm rich and I buy a Gucci bag
  2. You're poor, but you want to look rich so you also buy a Gucci handbag
  3. Now I don't think the bag is as exclusive, so I don't want mine any more
  4. Now that the rich guy isn't wearing it anymore, you don't want yours either
  5. But now no one has it, so it seems exclusive again, so now I want it
  6. Repeat

This doesn't mean markets won't provide Gucci bags (obviously, they do), but there isn't a price equilibrium, it will fluctuate forever. In terms of the original point, the Gucci bag allocation problem isn't isomorphic to a market equilibrium, because there is no such equilibrium.

Comment by purple fire (jack-edwards) on purple fire's Shortform · 2025-02-03T03:00:00.279Z · LW · GW

People should take the time to understand the ecosystem and economics of enterprise software. I expect that to be a major determinant of the incentive landscape around AGI and have found in conversations that people often don't really get it.

Comment by purple fire (jack-edwards) on Daniel Kokotajlo's Shortform · 2025-02-02T21:32:16.443Z · LW · GW

I'm not sure your idea about training two different CoT processes and penalizing divergence would work...

Me either, this is something I'm researching now. But I think it's a promising direction and one example of the type of experiment we could do to work on this.

if their models are closed, they probably want to hide the CoT so others can't train on it / distill it

This could be a crux? I expect most of the economics of powerful AI development to be driven by enterprise use cases, not consumer products.[1] In that case, I think faithful CoT is a strong selling point and it's almost a given that there will be data provenance/governance systems carefully restricting access of the CoT to approved use cases. I also think there's incentive for the CoT to be relatively faithful even if there's just a paraphrased version available to the public, like ChatGPT has now. When I give o3 a math problem, I want to see the steps to solve it, and if the chain is unfaithful, the face model can't do that.

I also think legible CoT is useful in multi-agent systems, which I expect to become more economically valuable in the next year. Again, there's the advantage that the space of unfaithful vocabulary is enormous. If I want a multi-agent system with, say, a chatbot, coding agent, and document retrieval agent, it might be useful for their chains to all be in the same "language" so they can make decisions based on each others' output. If they are just blindly RL'ed separately, the whole system probably doesn't work as well. And if they're RL'ed together, you have to do that for every unique composition of agents, which is obviously costlier. Concretely, I would claim that "using natural language tokens that are easy for humans to understand is not the absolute most efficient way for artificial minds to think" is true, but I would say that "using natural language tokens that are easy for humans to understand is the most economically productive way for AI tools to work" is true.

PR reasons, yeah I agree that this disincentivizes CoT from being visible to consumers, not sure it has an impact on faithfulness.

This is getting a little lengthy, it may be worth a post if I have time soon :) But happy to keep chatting here as well!

  1. ^

    My epistemically weak hot take is that ChatGPT is effectively just a very expensive recruitment tool to get talented engineers to come work on enterprise AI, lol

Comment by purple fire (jack-edwards) on Daniel Kokotajlo's Shortform · 2025-02-02T20:59:19.032Z · LW · GW

Yes, this is the exact setup which cause me to dramatically update my P(Alignment) a few months ago!  There are also some technical tricks you can do to make this work well--for example, you can take advantage of the fact that there are many ways to be unfaithful and only one way to be faithful, train two different CoT processes at each RL step, and add a penalty for divergence.[1] Ditto for periodic paraphrasing, reasoning in multiple languages, etc.

I'm curious to hear more about why you don't expect companies to invest much into this. I actually suspect that it has a negative alignment tax. I know faithful CoT is something a lot of customers want-it's just as valuable to accurately see how a model solved your math problem, as opposed to just getting the answer. There's also an element of stickiness. If your Anthropic agents work in neuralese, and then OpenAI comes out with a better model, the chains generated by your Anthropic agents can't be passed to the better model. This also makes it harder for orgs to use agents developed by multiple different labs in a single workflow. These are just a few of the reasons I expect faithful CoT to be economically incentivized, and I'm happy to discuss more of my reasoning or hear more counterarguments if you're interested in chatting more!

  1. ^

    To be clear, this is just one concrete example of the general class of techniques I hope people work on around this.

Comment by purple fire (jack-edwards) on Daniel Kokotajlo's Shortform · 2025-02-02T19:22:40.754Z · LW · GW

I'm curious how your rather doom-y view of Steps 3 and 4 interacts with your thoughts on CoT. It seems highly plausible that we will be able to reliably incentivize CoT faithfulness during Step 3 (I know of several promising research directions for this), which wouldn't automatically improve alignment but would improve interpretability. That interpretable chain of thought can be worked into a separate model or reward signal to heavily penalize divergence from the locked-in character, which--imo--makes the alignment problem under this training paradigm meaningfully more tractable than with standard LLMs. Thoughts?

Comment by purple fire (jack-edwards) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-01T20:47:32.369Z · LW · GW

Ah, well if there are interdependent demand functions then there is no guarantee of a (general) equilibrium, so the starting resource allocation problem may have 0 or many solutions. So I would say no, the isomorphism doesn't hold in that case.

Comment by purple fire (jack-edwards) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-02-01T20:08:30.466Z · LW · GW

I'm pretty sure they need to be at least locally true to guarantee what I said. Are there specific scenarios you're imagining?

For what it's worth, in most cases you'll probably be able to reframe the resource allocation problem so that these assumptions hold. For example, you can account for negative utility goods by just taking the complement and giving it a positive price. If I want to account for agents having a "negative" price for pollution, you can instead model how much agents value clean air and give it a positive price to satisfy monotonic utility.

Comment by purple fire (jack-edwards) on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-31T19:29:29.878Z · LW · GW

...theorem that says that the optimal solution of a complex resource allocation problem is isomorphic to a system where things have prices.

 

This is true, if we interpret "optimal solution" as the set of Pareto efficient allocations and make some very general assumptions about monotonic utility and demand independence![1] The idea of the proof is to take the dual of the resource allocation problem to turn it into a resource valuation problem, and by strong duality theorem these have the same optima (the solution to the dual of a resource allocation problem is a vector of prices). This is an A-D model, so it is known that markets will clear at these prices. By Welfare Thm I, we know that markets that clear result in a Pareto efficient outcome, and by Welfare Thm II, we know that any Pareto efficient outcome can be supported by some set of prices produced by this process. Any process optimizing an allocation of scarce resources is isomorphic to a set of utility-maximizing agents exchanging those resources in a free market. Pretty cool!

  1. ^

    Agents weakly prefer more of a good to less of it, and agents' demand for a good doesn't depend on other people having it.

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-27T04:33:07.228Z · LW · GW

These are close but not quite the claims I believe.

I do think that CoT systems based on pure LLMs will never be that good at problem-solving because a webtext-trained assistant just isn't that good at working with long chains of reasoning. I think any highly capable CoT system will require at least some RL (or be pre-trained on synthetic data from another CoT system that was trained with RL, but I'm not sure it makes a difference here). I'm a little less confident about whether pure LLMs will be disincentivized--for example, labs might stop developing CoT systems if inference-time compute requirements are too expensive--but I think labs will generally move more resources toward CoT systems.

I think the second two points are best explained with an example, which might clarify how I'm approaching the question.

Suppose I make two LLMs, GPT large (with more parameters) and GPT small (with fewer). I pre-train them on webtext and then I want to teach them how to do modular addition, so I create a bunch of synthetic data of input-output pairs like {6 + 2 mod 5, 3} and finetune the LLMs with the synthetic data to output a single answer, using the difference between their output and the answer as a loss function. GPT large becomes very good at this task, and GPT small does not.

So I create a new dataset of input-output pairs like {Solve 6 + 2 mod 5 step-by-step, writing out your reasoning at each step. Plan your approach ahead of time and periodically reflect on your work., 3}. I train GPT small on this dataset, but when it gets the answer right I reward the entire chain of thought, not just the token with the answer. This approach incentivizes GPT small to use a CoT to solve the problem, and now it performs as well as GPT large did with regular finetuning.[1]

In the end, I have two equally capable (at modular arithmetic) systems--GPT large, which was trained only with finetuning, and GPT small, which was trained with finetuning + open-ended RL. I have a few claims here:

  • GPT small's CoT is likely to reflect how it's actually solving the problem. It couldn't do the problems pre-RL, so we know it isn't just solving them internally and backfilling a plausible explanation. We can prevent steganography by doing things like periodically paraphrasing the CoT or translating it between languages. We can also verify this by altering the CoT to plausible but incorrect explanations and ensuring that task performance is degraded.
  • For this reason, GPT small is much more interpretable, since we can look at the CoT to understand how it solved the problem. GPT large, on the other hand, is still a complete black box--we don't know how it's solving the problem. When we finetuned it, GPT large learned how to do these problems in a single forward pass, making it incredibly hard to understand its reasoning.
  • And for this reason, GPT small is also easier to align. We can monitor the CoT to make sure it's actually doing modular arithmetic. In contrast, GPT large might be doing something that locally approximates modular arithmetic but behaves unpredictably outside the training distribution. In fact, if we deploy GPT small in out-of-distribution contexts (such as inputting negative numbers), the CoT will likely provide useful information about how it plans to adapt and what its problem-solving approach will be.

I am much more excited about building systems like GPT small than I am about building systems like GPT large. Do you disagree (or disagree about any subpoints, or about this example's generality?)

 

P.S. I am enjoying this discussion, I feel that you've been very reasonable and I continue to be open to changing my mind about this if I see convincing evidence :)

  1. ^

    Oversimplified obviously, but details shouldn't matter here

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-27T02:50:37.714Z · LW · GW

I agree that CoT faithfulness isn't something we should assume occurs by default (although it seems like it does to some extent). My claim is that CoT faithfulness is a tractable problem and that people have already made meaningful steps toward guaranteeing faithfulness.

Happy to discuss this further, but have you read e.g. https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated

Comment by purple fire (jack-edwards) on How to (hopefully ethically) make money off of AGI · 2024-12-24T20:41:54.827Z · LW · GW

You would have also very strongly outperformed the S&P 500. That is quite good. 

When you backtest stock picks against SPY, you usually want to compare that portfolio to holding SPY levered to the same volatility (or just compare Sharpe ratios). Having a higher total return might mean that you picked good stocks, or it might just mean that you took on more risk. People generally care about return on risk rather than return on dollars, since sophisticated investors can take on ~unlimited amounts of leverage for sufficiently derisked portfolios.

In this case, the portfolio has a Sharpe ratio of 2.0, which is indeed pretty good, especially for an unhedged long equity portfolio, so props to NoahK! (When I worked at a hedge fund, estimated 2 Sharpe was our threshold for trades.) But it's not as much of an update as 60% annual return would suggest on the surface.

That is... OK. Honestly, also looking at the composition of this index fund, I am not very impressed. Making only a 12% return in the year 2024 on AI stocks does feel like you failed at actually indexing on the AI market.

IRBO's Sharpe ratio is below 1, which is pretty awful. In my not-financial-advice-opinion, IRBO is uninvestable and looking at the top holdings at the time of this interview was enough to recognize that (MSTR is basically a levered crypto & interest rate instrument, SPLK was a merger arb trade, etc.).

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-23T17:55:36.528Z · LW · GW

I agree that this might be our crux, so I'll try to briefly explain my side. My view is still more or less training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. I think this is only true with open-ended RL because:

  •  Regular CoT/prompt engineering is effective, but not so effective that I expect it to meaningfully change the incentive landscape for creating base models. For example, when people figured out that CoT improved benchmarks for GPT-3.5, I don't think that disincentivized the development of GPT-4. In contrast, I do think the creation of the o-series models (with open-ended RL) is actively disincentivizing the development of GPT-5, which I see as a good thing.
  • Open-ended RL might make compositional AI safer, not less safe. Done right, it discourages models from learning to reason strongly in the forward pass, which is imo the most dangerous capability.

Again, I agree that you don't need open-ended RL for CoT systems, but if you aren't using RL on the entire output then you need a more capable forward pass, and this seems bad. In effect, your options are:

  1. Build a regular model and then do CoT during inference (e.g. via prompt engineering)
  2. Build a model and reward it based on CoT during training with RL

Option 1 creates much more capable forward passes, Option 2 does not. I think we have a much better shot at aligninf models built the second way.

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-23T00:09:33.154Z · LW · GW

I don't disagree that there remains a lot of work to be done, I understand that COT can be unfatihful, and I am generally against building very capable models that do CoT in latent space,[1] like the Meta paper does. Emphatically, I do not think "alignment is solved" just because o3 reasons out loud, or something. 

But, in my view, the research that needs to happen between here and aligned AGI is much more tractable with a weak forward pass and RL-trained CoT as opposed to a highly capable forward pass without RL. I can see an actual path forward to aligning AGI that works like the o-series model, and considering how recently this even became a research topic I think the work that's already been done is quite promising, including many of Daniel's proposals.

  1. ^

    This is a very general statement, there are lots of caveats and nuances, but I suspect we already agree on the broad strokes.

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-22T23:58:43.577Z · LW · GW

Likewise, I appreciate your willingness to explain your argument and the opportunity to explain mine has forced me to reflect more on what I believe, which is certainly a plus.

On the first point, I think the statement "priority number one should be to not exceed a dangerous level of capabilities" requires some nuance. For example, if an AI lab pretrainined a model that had a dangerous level of capabilities, and then introduced some robust safety features that prevented the use of those capabilities, I think that's acceptable. In fact, this is already happening--GPT-4 by default is useful for building weapons, but typically refuses to reveal that knowledge to users (yes, jailbreaking etc. exists but the current RLHF is pretty good). So I would agree with an amended statement that "priority number one should be to not exceed a dangerous level of capabilities without correspondingly robust safety features." I also think that generic statements like "we shouldn't build dangerous AI" are somewhat akin to statements like "we should reduce homelessness"--if you have a specific intervention or policy in mind, I would be happy to discuss it!

I think your second point is basically our crux. It seems (correct me if I'm wrong!) that the way you are seeing RL models is something like training LLMs with RL enhances their capabilities and makes them more agentic, and this is bad because more capable models are bad. I don't disagree with this at face value, but I continue to think it misses an important capabilities/safety tradeoff. My view is more like training LLMs with RL enhances their capabilities and makes them more agentic, but is net positive because it incentivizes the development of easier-to-align CoT systems over harder-to-align base LLMs. Perhaps our crux is around this tradeoff--are CoT systems sufficiently easier to align that they are safer to develop despite being more capable and more agentic?

My answer is yes. To make sure we're on the same page,[1] I'm thinking about a CoT system with two parts, an internal reasoning model and an external output model. The reasoning model is initially trained primarily for helpfulness and honesty, with RL to incentivize more correct answers on math, science, historical reasoning, etc. Importantly, there is no safety reward in the initial training, to prevent deceptive alignment, steganography, etc. Therefore this reasoning model's output is potentially harmful but also faithful--it actually reflects how the model solved a problem. Then the output model is trained to convert the CoT into the user-facing output, and to detect safety issues, etc. (There may also be safety-relevant post-training in the reasoning model, e.g. https://openai.com/index/deliberative-alignment/)

Without a doubt, the internal reasoning model is more capable and more agentic--as you wrote, incentivizing over a long sequence of tokens rather than a few at a time makes models less safe and more agentic, which I agree is bad. And there are still problems--ensuring that the CoT is faithful, properly implementing safety specs in the output model, etc. However, aligning this style of system is a much more tractable task than aligning a base LLM, and it is heartening to me that these systems might work sufficiently well with smaller LLMs that nobody is ever incentivized to build Claude 6 or whatever. Do you disagree?

  1. ^

    On a meta note, I'm an ML researcher and so have some implicit assumptions about how these are constructed because of the specific models I work on. This could be making us both confused, so I'm trying to be more explicit about the systems I'm envisioning. For what it's worth, though, I think I would have the some opinion even if I was purely an outside observer.

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-22T04:09:21.764Z · LW · GW

I think we're working with a different set of premises, so I'll try to disentangle a few ideas.

First, I completely agree with you that building superhuman AGI carries a lot of risks, and that society broadly isn't prepared for the advent of AI models that can perform economically useful labor.

Unfortunately, economic and political incentives being what they are, capabilities research will continue to happen. My more specific claim is that conditional on AI being at a given capabilities level, I prefer to reach that level with less capable text generators and more advanced RL/scaffolding (e.g. o1) as opposed to simply more capable text generators (e.g. GPT-4). I believe that the former lends itself to stronger oversight mechanisms, more tractable interpretability techniques, and more robust safety interventions during real-world deployment.

"It sounds like you believe that training a model with RL would make it somehow more transparent, whereas I believe the opposite. Can you explain your reasoning?"

This might have come across wrong, and is a potential crux. Conditioning on a particular text-generation model, I would guess that applying RL increases the risk--for example, I would consider Gemini 2.0 Flash Thinking as riskier than Gemini 2.0 Flash. But if you just showed me a bunch of eval results for an unknown model and asked how risky I thought the model was based on those, I would be more concerned about a fully black-box LLM than a RL CoT/scaffolded LM.

"Do you disagree that RL pushes models to be better at planning and exceeding human-level capabilities?"

No, it seems pretty clear that RL models like o3 are more capable than vanilla LLMs. So in a sense, I guess I think RL is bad because it increases capabilities faster, which I think is bad. But I still disagree that RL is worse for any theoretical reason beyond "it works better".

Tying this all back to your post, there are a few object-level claims that I continue to disagree with, but if I came to agree with you on them I would also change my mind more on the overall discussion. Specifically:

  • The underlying LLMs have no reason to lie in very strategic ways, because they are not trained to plan ahead. (In particular, I don't understand why you think this is true for what you call open-ended RL, but not for RLHF which you seem to be okay with.)
  • Human-level intelligence also ceases to be an important mark, because RL is about solving problems, not mimicking humans. (Again, isn't this already a problem with RLHF? And/or synthetic data?)
  • At this point we can no longer trust the chains of thought to represent their true reasoning, because models are now rewarded based on the final results that these chains lead to. Even if you put a constraint requiring the intermediate tokens to appear like logical reasoning, the models may find ways to produce seemingly-logical tokens that encode additional side information useful for the problem they are trying to solve. (I agree with this naively, but think this probelm is a lot more tractable than e.g. interpretability on a 100b parameter transformer.)

    Of course, I'm more than open to hearing stronger arguments for these, and would happily change my mind if I saw convincing evidence.

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-22T00:01:11.751Z · LW · GW

"you risk encouraging i) CoTs that carry side information that's only known to the model"

This is true by default, but not intractable. For example, you can train the CoT model with periodic paraphrasing to avoid steganography, or you can train a CoT model just for capabilities and introduce a separate model that reasons about safety. Daniel Kokotajlo has some nice writeups about these ideas, he calls it the Shoggoth/Face framework.

"superhuman capabilities"

Agreed that this would be bad, but condition on this happening, better to do it with RL CoT over blackbox token generation.

"planning ahead and agency in ways that are difficult to anticipate"

Not sure why this would be the case--shouldn't having access to the model's thought process make this easier to anticipate than if the long-term plans were stored in neuralese across a bunch of transformer layers?

"RL encotages this reasoning process to be more powerful, more agentic, and less predictable"

This is something I agree with in the sense that our frontier models are trained with RL, and those models are also the most powerful and most agentic (since they're more capable), but I'm not convinced that this is inherent to RL training, and I'm not exactly sure in what way these models are less predictable.

Comment by purple fire (jack-edwards) on AGI with RL is Bad News for Safety · 2024-12-21T20:21:57.270Z · LW · GW

I disagree. I think the current approach, with chain-of-thought reasoning, is a marked improvement over naive language modelling in terms of alignment difficulty. CoT allows us to elicit higher capabilities out of the same level base text generation model, meaning less of the computation is done inside the black box and more is done in human-readable tokens. While this still (obviously) has risks, it seems preferable to models that fully internalize the reasoning process. Do you agree with that?

Comment by purple fire (jack-edwards) on The Hopium Wars: the AGI Entente Delusion · 2024-10-21T19:09:14.367Z · LW · GW

If the West pursues this entente strategy, it virtually guarantees that China will too

Is there any reason to believe that if the West doesn't pursue this strategy, China won't either? That seems like a potential crux.