sjadler

Posts
Comments

Posts

AI companies’ unmonitored internal AI use poses serious risks 2025-04-04T18:17:46.924Z

AI companies should be safety-testing the most capable versions of their models 2025-03-26T19:03:41.790Z

sjadler's Shortform 2024-12-23T18:13:35.118Z

Comments

Comment by sjadler on OpenAI rewrote its Preparedness Framework · 2025-04-15T20:38:39.270Z · LW · GW

Unfortunately it seems that OpenAI has walked back the Preparedness Framework's previous commitment to testing fine-tuned versions of its models, and also did not highlight this among the changes. I tweeted a bit more detail here

Comment by sjadler on Benito's Shortform Feed · 2025-04-08T22:06:23.253Z · LW · GW

What do you mean here by "does not mean anything"?

It seems clear to me that there's some notion of off-the-record that journalists understand.

This might vary on details, and I agree is probably not legally binding, but it does seem to mean something.

Comment by sjadler on AI companies’ unmonitored internal AI use poses serious risks · 2025-04-06T17:27:17.394Z · LW · GW

I appreciate the feedback. That’s interesting about the plane vs. car analogy - I tended to think about these analogies in terms of life/casualties, and for whatever reason, describing an internal test-flight didn’t rise to that level for me (and if it’s civilian passengers, that’s an external deployment). I also wanted to convey the idea not just that internal testing could cause external harm, but that you might irreparably breach containment. Anyway, appreciate the explanation, and I hope you enjoyed the post overall!

Comment by sjadler on AI companies should be safety-testing the most capable versions of their models · 2025-03-27T19:44:10.280Z · LW · GW

Scaffolding for sure matters, yup!

I think you're generally correct that the most-capable version hasn't been created, though there are times where AI companies do have specialized versions for a domain internally, and don't seem to be testing these anyway. It's reasonable IMO to think that these might outperform the unspecialized versions.

Comment by sjadler on AI companies should be safety-testing the most capable versions of their models · 2025-03-26T22:14:44.822Z · LW · GW

Daniel said:

Thanks for doing this, I found the chart very helpful! I'm honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like "All of this will be worse than useless if they don't eventually make fine-tuning an important part of the evals" and everyone was like "yep of course we'll get there eventually, for now we will do the weaker elicitation techniques." It is now almost three years later...

Comment by sjadler on Daniel Kokotajlo's Shortform · 2025-03-26T19:04:40.320Z · LW · GW

The post is now live on Substack, and link-posted to LW:

https://stevenadler.substack.com/p/ai-companies-should-be-safety-testing

https://www.lesswrong.com/posts/tQzeafo9HjCeXn7ZF/ai-companies-should-be-safety-testing-the-most-capable

Comment by sjadler on Mikhail Samin's Shortform · 2025-03-26T07:44:37.066Z · LW · GW

I’ve only seen this excerpt, but it seems to me like Jack isn’t just arguing against regulation because it might slow progress - and rather something more like:

“there’s some optimal time to have a safety intervention, and if you do it too early because your timeline bet was wrong, you risk having worse practices at the actually critical time because of backlash”

This seems probably correct to me? I think ideally we’d be able to be cautious early and still win the arguments to be appropriately cautious later too. But empirically, I think it’s fair not to take as a given?

Comment by sjadler on 1nsanereader's Shortform · 2025-03-26T07:38:01.890Z · LW · GW

You might find this post interesting and relevant if you haven’t seen it before: https://www.econlib.org/archives/2017/04/iq_with_conscie.html

Comment by sjadler on Mo Putera's Shortform · 2025-03-26T07:25:58.391Z · LW · GW

I’d guess that was “I have a lecture series with her” :-)

Comment by sjadler on trevor's Shortform · 2025-03-20T20:46:59.479Z · LW · GW

I think they mean heuristics for who is ok to dehumanize / treat as “other” or harm

Comment by sjadler on Daniel Kokotajlo's Shortform · 2025-03-19T17:23:22.922Z · LW · GW

Strong endorse; I was discussing this with Daniel, and my read of various materials is that many labs are still not taking this as seriously as they ought to - working on a post about this, likely up next week!

Comment by sjadler on Anthropic, and taking "technical philosophy" more seriously · 2025-03-19T01:00:50.780Z · LW · GW

Very useful post! Thanks for writing it.

is robust to ontological updates

^ I think this might be helped by an example of the sort of ontological update you'd expect might be pretty challenging; I'm not sure that I have the same things in mind as you here

(I imagine one broad example is "What if AI discovers some new law of physics that we're unaware of", but it isn't super clear to me how that specifically collides with value-alignment-y things?)

Comment by sjadler on DAL's Shortform · 2025-03-18T04:31:08.443Z · LW · GW

I appreciate the question you’re asking, to be clear! I’m less familiar with Anthropic’s funding / Dario’s comments, but I don’t think the magnitudes of ask-vs-realizable-value are as far off for OpenAI as your comment suggests?

Eg, If you compare OpenAI’s reported raised at $157B most recently, vs. what its maximum profit-cap likely was in the old (still current afaik) structure.

The comparison gets a little confusing, because it’s been reported that this investment was contingent on for-profit conversion, which does away with the profit cap.

But I definitely don’t think OpenAI’s recent valuation and the prior profit-cap would be magnitudes apart.

(To be clear, I don’t know the specific cap value, but you can estimate it - for instance by analyzing MSFT’s initial funding amount, which is reported to have a 100x capped-profit return, and then adjust for what % of the company you think MSFT got.)

(This also makes sense to me for a company in a very competitive industry, with high regulatory risk, and where companies are reported to still be burning lots and lots of cash.)

Comment by sjadler on DAL's Shortform · 2025-03-17T22:50:33.922Z · LW · GW

If the companies need capital - and I believe that they do - what better option do they have?

I think you’re imagining cash-rich companies choosing to sell portions for dubious reasons, when they could just keep it all for themselves.

But in fact, the companies are burning cash, and to continue operating they need to raise at some valuation, or else not be able to afford the next big training run.

The valuations at which they are raising are, roughly, where supply and demand equilibriate for the amounts of cash that they need in order to continue operating. (Possibly they could raise at higher valuations from taking on less-scrupulous investors, but to date I believe some of the companies have tried to avoid this.)

Comment by sjadler on Linch's Shortform · 2025-03-09T21:23:40.092Z · LW · GW

Interesting material yeah - thanks for sharing! Having played a bunch of these, I think I’d extend this to “being correctly perceived is generally bad for you” - that is, it’s both bad to be a bad liar who’s known as bad, and bad to be good liar who’s known as good (compared to this not being known). For instance, even if you’re a bad liar, it’s useful to you if other players have uncertainty about whether you’re actually a good liar who’s double-bluffing.

I do think the difference between games and real-life may be less about one-time vs repeated interactions, and more about the ability to choose one’s collaborators in general? Vs teammates generally being assigned in the games.

One interesting experience I’ve had, which maybe validates this: I played a lot of One Night Ultimate Werewolf with a mixed-skill group. Compared to other games, ONUW has relatively more ability to choose teammates - because some roles (like doppelgänger or paranormal investigator, or sometimes witch) essentially can choose to join the team of another player.

Suppose Tom was the best player. Over time, more and more players in our group would choose actions that made them more likely to join Tom’s team, which was basically a virtuous cycle for Tom: in a given game, he was relatively more likely to have a larger number of teammates - and # teammates is a strong factor in likelihood of winning.

But, this dynamic would have applied equally in a one-time game I think, provided people knew this about Tom and still had a means of joining his team.

Comment by sjadler on So how well is Claude playing Pokémon? · 2025-03-08T20:47:12.989Z · LW · GW

Possibly amusing anecdote: when I was maybe ~6, my dad went on a business trip and very kindly brought home the new Pokémon Silver for me. Only complication was, his trip had been to Japan, and the game was in Japanese (it wasn’t yet released in the US market), and somehow he hadn’t realized this.

I managed to play it reasonably well for a while based on my knowledge of other Pokémon games. But eventually I ran into a person blocking a bridge, who (I presumed) was saying something about what I needed to do before I could advance. But, I didn’t understand what they were saying because it was in Japanese.

I had planned to seek out someone who spoke Japanese, and ask their help translating for me, but unfortunately there was almost nobody in my town who did. And so instead I resolved to learn Japanese - and that’s the story of what led to me becoming fluent at a young age.

(Just kidding - after flailing around a bit with possibly bypasses, I gave up on playing the game until I got the US version.)

Comment by sjadler on sjadler's Shortform · 2025-02-09T12:48:45.776Z · LW · GW

To me, this seems consistent with just maximizing shareholder value. … "being the good guys" lets you get the best people at significant discounts.

This is pretty different from my model of what happened with OpenAI or Anthropic - especially the latter, where the founding team left huge equity value on the table by departing (OpenAI’s equity had already appreciated something like 10x between the first MSFT funding round and EOY 2020, when they departed).

And even for Sam and OpenAI, this would seem like a kind of wild strategy for pursuing wealth for someone who already had the network and opportunities he had pre-OpenAI?

Comment by sjadler on sjadler's Shortform · 2025-02-09T12:38:39.129Z · LW · GW

Just guessing, but maybe admitting the danger is strategically useful, because it may result in regulations that will hurt the potential competitors more. The regulations often impose fixed costs (such as paying a specialized team which produces paperwork on environmental impacts), which are okay when you are already making millions.

My sense of things is that OpenAI at least appears to be lobbying against regulation moreso than they are lobbying for it?

Comment by sjadler on ozziegooen's Shortform · 2025-02-09T09:51:42.097Z · LW · GW

I don’t think you intended this implication, but I initially read “have been dominating” as negative-valenced!

Just want to say I’ve been really impressed and appreciative with the amount of public posts/discussion from those folks, and it’s encouraged me to do more of my own engagement because I’ve realized how helpful their comments/posts are to me (and so maybe mine likewise for some folks).

Comment by sjadler on sjadler's Shortform · 2025-02-09T09:45:49.709Z · LW · GW

It’s interesting to me that the big AI CEOs have largely conceded that AGI/ASI could be extremely dangerous (but aren’t taking sufficient action given this view IMO), as opposed to them just denying that the risk is plausible. My intuition is that the latter is more strategic if they were just trying to have license to do what they want. (For instance, my impression is that energy companies delayed climate action pretty significantly by not yielding at first on whether climate change is even a real concern.)

I guess maybe the AI folks are walking a strategic middle ground? Where they concede there could be some huge risk, but then also sometimes say things like ‘risk assessment should be evidence-based,’ with the implication that current concerns aren’t rigorous. And maybe that’s more strategic than either world above?

But really it seems to me like they’re probably earnest in their views about the risks (or at least once were earnest). And so, political action that treats their concern as disingenuous is probably wrongheaded, as opposed to modeling them as ‘really concerned but facing very constrained useful actions’.

Comment by sjadler on Ten people on the inside · 2025-01-29T09:06:53.492Z · LW · GW

I really appreciate this write up. I felt sad while reading it that I have a very hard time imagining an AI lab yielding to another leader it considers to be irresponsible - or maybe not even yielding to one it considers to be responsible. (I am not that familiar with the inner workings at Anthropic though, and they are probably top of my list on labs that might yield in those scenarios, or might not race desperately if in a close one.)

One reason for not yielding is that it’s probably hard for one lab to definitively tell that another lab is very far ahead of them, since we should expect some important capability info to remain private.

It seems to me then that ways of labs credibly demonstrating leads, without leaking info that allows others to catch up, might be a useful thing to exist - perhaps paired with enforceable conditional commitments to yield if certain conditions are demonstrated.

Comment by sjadler on TsviBT's Shortform · 2025-01-28T09:47:40.462Z · LW · GW

A plug for another post I’d be interested in: If anyone has actually evaluated the arguments for “What if your consciousness is ~tortured in simulation?” as a reason to not pursue cryo. Intuitively I don’t think this is super likely to happen, but various moral atrocities have and do happen, and that gives me a lot of pause, even though I know I’m exhibiting some status quo bias

Comment by sjadler on meemi's Shortform · 2025-01-21T09:02:16.516Z · LW · GW

Some AI companies, like OpenAI, have “eyes-off” APIs that don’t log any data afaik (or perhaps log only the minimum legally permitted, with heavy restrictions on who can access): described as Zero Day Retention here, https://openai.com/enterprise-privacy/ : How does OpenAI handle data retention and monitoring for API usage?

Comment by sjadler on Zombies among us · 2024-12-31T20:02:14.781Z · LW · GW

At the risk of being pedantic, just noting I don’t think it’s really correct to consider that first person as earning $300/hr. For example I’d expect to need to account for the depreciation on the jet skis (or more straightforwardly, that one is in the hole on having bought them until a certain number of hours rented), and also presumably some accrual for risk of being sued in the case of an accident.

(I do think it’s useful to notice that the jet ski rental person is much higher variance, in both directions IMO - so this can be both good or bad. I do also appreciate you sharing your experience!)

Comment by sjadler on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T20:32:33.906Z · LW · GW

It’s much more the same than a lot of prosaic safety though, right?

Let me put it this way: If an AI can’t achieve catastrophe on that order of magnitude, it also probably cannot do something truly existential.

One of the issues this runs into is if a misaligned AI is playing possum, and so doesn’t attempt lesser catastrophes until it can pull off a true takeover. I nonetheless though think this framing points generally at the right type of work (understood that others may disagree of course)

Comment by sjadler on Alexander Gietelink Oldenziel's Shortform · 2024-12-29T08:03:39.652Z · LW · GW

I’ve often preferred a frame of ‘catastrophe avoidance’ over a frame of x-risk. This has a possible downside of people underfeeling the magnitude of risk, but also an upside of IMO feeling way more plausible. I think it’s useful to not need to win specific arguments about extinction, and also to not have some of the existential/extinction conflation happening in ‘x-‘.

Comment by sjadler on evhub's Shortform · 2024-12-29T07:18:51.207Z · LW · GW

If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.

For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations.

(This was harder to clearly describe than I expected.)

Comment by sjadler on What are the most interesting / challenging evals (for humans) available? · 2024-12-29T03:54:10.520Z · LW · GW

Great! Appreciate you letting me know & helping debug for others

Comment by sjadler on What are the most interesting / challenging evals (for humans) available? · 2024-12-28T18:55:29.647Z · LW · GW

Oh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?

https://github.com/openai/evals/blob/main/evals/eval.py#L211

(There are two places in this file where threads =, would change 10 to 1 in each)

Comment by sjadler on What are the most interesting / challenging evals (for humans) available? · 2024-12-27T07:28:15.353Z · LW · GW

I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.

It’s runnable from the command-line (after repo install) with the command: oaieval human_cli function_deduction (I believe that’s the right second term, but it might just be humancli)

The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this partial feedback and so would be harder to find any brute forcing method.

For context, I could solve ~95% of the standard problems with a time limit of 5 minutes and a 20 turn game limit. GPT-4 could solve roughly half within that turn limit IIRC, and o1-preview could do ~99%. (o1-preview also scored ~99% on the hard mode variant; I never tested myself on this but I estimate maybe I’d have gotten 70%? The problems could also be scaled up in difficulty pretty easily if one wanted.)

Comment by sjadler on shortplav · 2024-12-26T19:42:03.801Z · LW · GW

This helps me to notice that there is a fairly strong and simple data poisoning attack possible with canaries, such that canaries can’t really be relied upon (amid other reasons I already believed they’re insufficient, esp once AIs can browse the Web):

The attack is that one could just siphoning up all text on the Internet that did have a canary string, and then republish it without the canary :/

Comment by sjadler on sjadler's Shortform · 2024-12-26T00:56:08.365Z · LW · GW

I agree, these are interesting points, upvoted. I’d claim that AI output also isn’t linear with the resources - but nonetheless, that you’re right that the curve of marginal return from each AI unit could be different from each human unit in an important way. Likewise, the easier on-demand labor of AI is certainly a useful benefit.

I don’t think these contradict the thrust of my point though? That in each case, one shouldn’t just be thinking about usefulness/capability, but should also be considering the resources necessary for achieving this.

Comment by sjadler on sjadler's Shortform · 2024-12-23T18:13:35.319Z · LW · GW

I believe we should view AGI as a ratio of capability to resources, rather than simply asking how AI's abilities compare to humans'. This view is becoming more common, but is not yet common enough.

When people discuss AI's abilities relative to humans without considering the associated costs or time, this is like comparing fractions by looking only at the numerators.

In other words, AGI has a numerator (capability): what the AI system can achieve. This asks questions like: For this thing that a human can do, can AI do it too? How well can AI do it? (For example, on a set of programming challenges, how many can the AI solve? How many can a human solve?)

But also importantly, AGI should account for the denominator: how many resources are required to achieve its capabilities. Commonly, this resource will be a $-cost or an amount of time. This asks questions like "What is the $-cost of getting this performance?" or "How long does this task take to complete?".

I claim that an AI system might fail to qualify as AGI if it lacks human-level capabilities, but it could also fail by being wildly inefficient compared to a human. Both the numerator and denominator matter when evaluating these ratios.

A quick example of why focusing solely on the capability (numerator) is insufficient:

Imagine an AI software engineer that can do most tasks human engineers do, but at 100–1000× the cost of a human.
I expect that lots of the predictions about "AGI" would not come due until (at least) that cost comes down substantially so that AI >= human on a capability-per-dollar basis.
For instance, it would not make sense to directly substitute AI for human labor at this ratio - but perhaps it does make sense to buy additional AI labor, if there were extremely valuable tasks for which human labor is the bottleneck today.

The AGI-as-ratio concept has long been implicit in some labs' definitions of AGI - for instance, OpenAI describes AGI as "a highly autonomous system that outperforms humans at most economically valuable work". Outperforming humans economically does seem to imply being more cost-effective, not just having the same capabilities. Yet even within OpenAI, the denominator aspect wasn’t always front of mind, which is why I wrote up a memo on this during my time there.

Until the recent discussions of o1 Pro / o3 drawing upon lots of inference compute, I rarely saw these ideas discussed, even in otherwise sophisticated analyses. One notable exception to this is my former teammate Richard Ngo's t-AGI framework, which deserves a read. METR has also done a great job of accounting for this in their research comparing AI R&D performance, given a certain amount of time. I am glad to see more and more groups thinking in terms of these factors - but in casual analysis, it is very easy to just slip into comparisons of capability levels. This is worth pushing back on, imo: The time at which "AI capabilities = human capabilities" is different than the time when I expect AGI will have been achieved in the relevant senses.

There are also some important caveats to my claim here, that 'comparing just by capability is missing something important':

People reasonably expect AI to become cheaper over time, so if AI matches human capabilities but not cost, that might still signal 'AGI soon.' Perhaps this is what people mean when they say 'o3 is AGI'.
Computers are much faster than humans for many tasks, and so one might believe that if an AI can achieve a thing it will quite obviously be faster than a human. This is less obvious now, however, because AI systems are leaning more on repeated sampling/selection procedures.
Comparative advantage is a thing, and so AI might have very large impacts even if it remains less absolutely capable than humans for many different tasks, if the cost/speed are good enough.
There are some factors that don't fit super cleanly into this framework: things like AI's 'always-on availability', which aren't about capability per se, but probably belong in the numerator anyway? e.g., "How good is an AI therapist?" benefits from 'you can message it around the clock', which increases the utility of any given task-performance. (In this sense, maybe the ratio is best understood as utility-per-resource, rather than capability-per-resource.)

Comment by sjadler on Daniel Kokotajlo's Shortform · 2024-12-18T18:05:39.649Z · LW · GW

In case anybody's looking for steganography evals - my team built and open-sourced some previously: https://github.com/openai/evals/blob/main/evals/elsuite/steganography/readme.md

This repo is largely not maintained any longer unfortunately, and for some evals it isn't super obvious how the new paradigm for O1 affects them (for instance, we had built solvers/scaffolding for private scratchpads, but now having a private CoT provides this out-of-the-box and so might interact with this strangely). But still perhaps worth a look

Comment by sjadler on Parable of the vanilla ice cream curse (and how it would prevent a car from starting!) · 2024-12-11T08:25:31.893Z · LW · GW

For what it’s worth, I sent this story to a friend the other day, who’s probably ~50 now and was very active on the Internet in the 90s - thinking he’d find it amusing if he hadn’t come across it before

Not only did he remember this story contemporaneously, but he said he was the recipient of the test-email for a particular city mentioned in the article!

This is someone of high-integrity whom I trust, and makes me more confident this happened, even if some details are smoothed over as described

Comment by sjadler on Sorry for the downtime, looks like we got DDosd · 2024-12-04T03:47:30.775Z · LW · GW

Yeah I appreciate the engagement, I don’t think either of those is a knock-down objection though:

The ability to illicitly gain a few credentials —> >1 account is still meaningfully different from being able to create ~unbounded accounts. It is true this means a PHC doesn’t 100% ensure a distinct person, but it can still be a pretty high assurance and significantly increase the cost of doing attacks that depend on scale.

Re: the second point, I’m not sure I fully understand - say more? By our paper’s definitions, issuers wouldn’t be able to merely choose to identify individuals. In fact, even if an issuer and service-provider colluded, PHCs are meant to be robust to this. (Devil is in the details of course.)

Comment by sjadler on Sorry for the downtime, looks like we got DDosd · 2024-12-03T23:49:51.430Z · LW · GW

Hi! Created a (named) account for this - in fact, I think you can conceptually get some of those reputational defenses (memory of behavior; defense against multi-event attacks) without going so far as to drop anonymity / prove one's identity!

See my Twitter thread here, summarizing our paper on Personhood Credentials.

Paper's abstract:

Anonymity is an important principle online. However, malicious actors have long used misleading identities to conduct fraud, spread disinformation, and carry out other deceptive schemes. With the advent of increasingly capable AI, bad actors can amplify the potential scale and effectiveness of their operations, intensifying the challenge of balancing anonymity and trustworthiness online. In this paper, we analyze the value of a new tool to address this challenge: "personhood credentials" (PHCs), digital credentials that empower users to demonstrate that they are real people -- not AIs -- to online services, without disclosing any personal information. Such credentials can be issued by a range of trusted institutions -- governments or otherwise. A PHC system, according to our definition, could be local or global, and does not need to be biometrics-based. Two trends in AI contribute to the urgency of the challenge: AI's increasing indistinguishability from people online (i.e., lifelike content and avatars, agentic activity), and AI's increasing scalability (i.e., cost-effectiveness, accessibility). Drawing on a long history of research into anonymous credentials and "proof-of-personhood" systems, personhood credentials give people a way to signal their trustworthiness on online platforms, and offer service providers new tools for reducing misuse by bad actors. In contrast, existing countermeasures to automated deception -- such as CAPTCHAs -- are inadequate against sophisticated AI, while stringent identity verification solutions are insufficiently private for many use-cases. After surveying the benefits of personhood credentials, we also examine deployment risks and design challenges. We conclude with actionable next steps for policymakers, technologists, and standards bodies to consider in consultation with the public.

User info

Posts

Comments