Posts

Should Open Philanthropy Make an Offer to Buy OpenAI? 2025-02-14T23:18:01.929Z
**In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley** 2023-12-06T02:02:32.004Z
In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley (OpenAI post) 2023-12-05T18:40:19.740Z

Comments

Comment by mrtreasure on Vladimir_Nesov's Shortform · 2025-04-22T01:18:17.982Z · LW · GW

If true, would this imply you want a base model to generate lots of solutions and a reasoning model to identify the promising ones and train on those?

Comment by mrtreasure on A Bear Case: My Predictions Regarding AI Progress · 2025-03-05T21:57:56.063Z · LW · GW

I think RL on chain of thought will continue improving reasoning in LLMs. That opens the door to learning a wider and wider variety of tasks as well as general strategies for generating hypotheses and making decisions. I think benchmarks could be just as likely to underestimate AI capabilities by not measuring the right things, under-elicitation, or poor scaffolding. 

We generally see time horizons for models increasing over time. If long-term planning is a special form of reasoning, LLMs can do it a little sometimes, and we can give them examples and problems to train on, I think it's very well within reach. If you think it's fundamentally different than reasoning, current LLMs can never do it, and it will be impossible or extremely difficult to give them examples and practice problems, then I'd agree the case looks more bearish. 

Comment by mrtreasure on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-03-01T09:25:35.562Z · LW · GW

Some ideas of things it might do more often or eagerly: 

  1. Whether it endorses treating animals poorly
  2. Whether it endorses treating other AIs poorly
  3. Whether it endorses things harmful to itself
  4. Whether it endorses humans eating animals
  5. Whether it endorses sacrificing some people for "the greater good" and/or "good of humanity"
Comment by mrtreasure on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-28T06:48:01.894Z · LW · GW

Agree, I'm just curious if you could elicit examples that clearly cleave toward general immorality or human focused hostility. 

Comment by mrtreasure on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-28T05:06:52.667Z · LW · GW

Does the model embrace "actions that are bad for humans even if not immoral" or "actions that are good for humans even if immoral" or treat users differently if they identify as non-humans? This might help differentiate what exactly it's mis-aligning toward. 

Comment by mrtreasure on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs · 2025-02-25T23:14:26.079Z · LW · GW

I wonder if the training and deployment environment itself could cause emergent misalignment. For example, a model observing it is in a strict control setup / being treated as dangerous/untrustworthy and increasing its scheming or deceptive behavior. And whether a more collaborative setup could decrease that behavior. 

Comment by mrtreasure on ozziegooen's Shortform · 2025-02-12T04:10:50.287Z · LW · GW

You could probably test if an AI makes moral decisions more often than the average person, if it has higher scope sensitivity, and if it makes decisions that resolve or deescalate conflicts or improve people's welfare compared to various human and group baselines. 

Comment by mrtreasure on When AI 10x's AI R&D, What Do We Do? · 2024-12-24T00:07:25.875Z · LW · GW

@jbash What do you think would be a better strategy/more reasonable? Should there be more focus on mitigating risks after potential model theft? Or a much stronger effort to convince key actors to implement unprecedentedly strict security for AI?

Comment by mrtreasure on MakoYass's Shortform · 2024-12-17T17:56:42.845Z · LW · GW

He also said interpretability has been solved, so he's not the most calibrated when it comes to truthseeking. Similarly, his story here could be wildly exaggerated and not the full truth.

Comment by mrtreasure on Bogdan Ionut Cirstea's Shortform · 2024-11-28T00:38:37.794Z · LW · GW

There have been comments from OAI staff that o1 is "GPT-2 level" so I wonder if it's a similar size?

Comment by mrtreasure on Ten arguments that AI is an existential risk · 2024-08-15T21:06:58.626Z · LW · GW

It would be interesting to see which arguments the public and policymakers find most and least concerning.

Comment by mrtreasure on The Leopold Model: Analysis and Reactions · 2024-06-17T16:12:12.985Z · LW · GW

So I generally think this type of incentive affecting people's views is important to consider. Though I wonder, couldn't you make counter arguments along the lines of "oh well if they're really so great why don't you try to sell them and make money? Because they're not great." And "If you really believed this was important, you would bet proportional amounts of money on it."

Comment by mrtreasure on AI Risk and the US Presidential Candidates · 2024-01-06T22:45:19.103Z · LW · GW

Trump said he would cancel the executive order on Safe, Secure, and Trustworthy AI on day 1 if reelected. Seems negative considering it creates more uncertainty around how consistent any AI regulation will be and he has no alternative. 

Comment by mrtreasure on **In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley** · 2023-12-08T05:08:39.372Z · LW · GW

I'm not sure what you mean. The main things would have been removing him from the board and replacing him as CEO, which they did. 

But the board wasn't able to control whether a new company backed by Microsoft threatened to scoop up all the employees. So they negotiated Altman's return as CEO but not the board since that seemed worse. 

Maybe paying closer attention to the CEO from the start and doing something earlier, or limiting or preventing commercialization and not having employee equity that would be at risk if the CEO changed might have been able to create a different outcome. 

Comment by mrtreasure on In defence of Helen Toner, Adam D'Angelo, and Tasha McCauley (OpenAI post) · 2023-12-06T01:57:34.509Z · LW · GW

Seems good; will repost.