Posts

Principles for the AGI Race 2024-08-30T14:29:41.074Z
Transformer Circuit Faithfulness Metrics Are Not Robust 2024-07-12T03:47:30.077Z
William_S's Shortform 2023-03-22T18:13:18.731Z
Thoughts on refusing harmful requests to large language models 2023-01-19T19:49:22.989Z
Prize for Alignment Research Tasks 2022-04-29T08:57:04.290Z
Is there an intuitive way to explain how much better superforecasters are than regular forecasters? 2020-02-19T01:07:52.394Z
Machine Learning Projects on IDA 2019-06-24T18:38:18.873Z
Reinforcement Learning in the Iterated Amplification Framework 2019-02-09T00:56:08.256Z
HCH is not just Mechanical Turk 2019-02-09T00:46:25.729Z
Amplification Discussion Notes 2018-06-01T19:03:35.294Z
Understanding Iterated Distillation and Amplification: Claims and Oversight 2018-04-17T22:36:29.562Z
Improbable Oversight, An Attempt at Informed Oversight 2017-05-24T17:43:53.000Z
Informed Oversight through Generalizing Explanations 2017-05-24T17:43:39.000Z
Proposal for an Implementable Toy Model of Informed Oversight 2017-05-24T17:43:13.000Z

Comments

Comment by William_S on I found >800 orthogonal "write code" steering vectors · 2024-07-15T19:25:28.880Z · LW · GW

Hypothesis: each of these vectors representing a single token that is usually associated with code, vectors says "I should output this token soon", and the model then plans around that to produce code. But adding vectors representing code tokens doesn't necessarily produce another vector representing a code token, so that's why you don't see compositionality. Does somewhat seem plausible that there might be ~800 "code tokens" in the representation space.

Comment by William_S on Habryka's Shortform Feed · 2024-07-05T23:56:55.319Z · LW · GW

Absent evidence to the contrary, for any organization one should assume board members were basically selected by the CEO. So hard to get assurance about true independence, but it seems good to at least to talk to someone who isn't a family member/close friend.

Comment by William_S on Habryka's Shortform Feed · 2024-07-05T17:53:58.045Z · LW · GW

Good that it's clear who it goes to, though if I was an anthropic I'd want an option to escalate to a board member who isn't Dario or Daniella, in case I had concerns related to the CEO

Comment by William_S on 80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly) · 2024-07-05T17:33:15.665Z · LW · GW

I do think 80k should have more context on OpenAI but also any other organization that seems bad with maybe useful roles. I think people can fail to realize the organizational context if it isn't pointed out and they only read the company's PR.

Comment by William_S on Habryka's Shortform Feed · 2024-07-01T18:59:50.564Z · LW · GW

I agree that this kind of legal contract is bad, and Anthropic should do better. I think there are a number of aggrevating factors which made the OpenAI situation extrodinarily bad, and I'm not sure how much these might obtain regarding Anthropic (at least one comment from another departing employee about not being offered this kind of contract suggest the practice is less widespread).

-amount of money at stake
-taking money, equity or other things the employee believed they already owned if the employee doesn't sign the contract, vs. offering them something new (IANAL but in some cases, this could be a felony "grand theft wages" under California law if a threat to withhold wages for not signing a contract is actually carried out, what kinds of equity count as wages would be a complex legal question)
-is this offered to everyone, or only under circumstances where there's a reasonable justification?
-is this only offered when someone is fired or also when someone resigns?
-to what degree are the policies of offering contracts concealed from employees?
-if someone asks to obtain legal advice and/or negotiate before signing, does the company allow this?
-if this becomes public, does the company try to deflect/minimize/only address issues that are made publically, or do they fix the whole situation?
-is this close to "standard practice" (which doesn't make it right, but makes it at least seem less deliberately malicious), or is it worse than standard practice?
-are there carveouts that reduce the scope of the non-disparagement clause (explicitly allow some kinds of speech, overriding the non-disparagement)?
-are there substantive concerns that the employee has at the time of signing the contract, that the agreement would prevent discussing?
-are there other ways the company could retaliate against an employee/departing employee who challenges the legality of contract?

I think with termination agreements on being fired there's often 1. some amount of severance offered 2. a clause that says "the terms and monetary amounts of this agreement are confidential" or similar. I don't know how often this also includes non-disparagement. I expect that most non-disparagement agreements don't have a term or limits on what is covered.

I think a steelman of this kind of contract is: Suppose you fire someone, believe you have good reasons to fire them, and you think that them loudly talking about how it was unfair that you fired them would unfairly harm your company's reputation. Then it seems somewhat reasonable to offer someone money in exchange for "don't complain about being fired". The person who was fired can then decide whether talking about it is worth more than the money being offered.

However, you could accomplish this with a much more limited contract, ideally one that lets you disclose "I signed a legal agreement in exchange for money to not complain about being fired", and doesn't cover cases where "years later, you decide the company is doing the wrong thing based on public information and want to talk about that publically" or similar.

I think it is not in the nature of most corporate lawyers to think about "is this agreement giving me too much power?" and most employees facing such an agreement just sign it without considering negotiating or challenging the terms.

For any future employer, I will ask about their policies for termination contracts before I join (as this is when you have the most leverage, if they give you an offer they want to convince you to join).

Comment by William_S on Buck's Shortform · 2024-06-25T21:01:29.801Z · LW · GW

Would be nice if it was based on "actual robot army was actually being built and you have multiple confirmatory sources and you've tried diplomacy and sabotage and they've both failed" instead of "my napkin math says they could totally build a robot army bro trust me bro" or "they totally have WMDs bro" or "we gotta blow up some Japanese civilians so that we don't have to kill more Japanese civilians when we invade Japan bro" or "dude I'm seeing some missiles on our radar, gotta launch ours now bro".

Comment by William_S on Buck's Shortform · 2024-06-24T23:43:23.683Z · LW · GW

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes

Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

Comment by William_S on Buck's Shortform · 2024-06-24T23:13:24.608Z · LW · GW

IMO it's unlikely that we're ever going to have a safety case that's as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
Thus, in practice, decisions will probably not be made on a safety case alone, but also based on some positive case of the benefits of deployment (e.g. estimated reduced x-risk, advancing the "good guys" in the race, CEO has positive vibes that enough risk mitigation has been done, etc.). It's not clear what role governments should have in assessing this, maybe we can only get assessment of the safety case, but it's useful to note that safety cases won't be the only thing informs these decisions.

This situation is pretty disturbing, and I wish we had a better way, but it still seems useful to push the positive benefit case more towards "careful argument about reduced x-risk" and away from "CEO vibes about whether enough mitigation has been done".

Comment by William_S on Richard Ngo's Shortform · 2024-06-22T08:22:24.241Z · LW · GW

Imo I don't know if we have evidence that Anthropic deliberately cultivated or significantly benefitted from the appearance of a commitment. However if an investor or employee felt like they made substantial commitments based on this impression and then later felt betrayed that would be more serious. (The story here is I think importantly different from other stories where I think there were substantial benefits from commitment appearance and then violation)

Comment by William_S on Richard Ngo's Shortform · 2024-06-22T08:19:12.869Z · LW · GW

Everyone is afraid of the AI race, and hopes that one of the labs will actually end up doing what they think is the most responsible thing to do. Hope and fear is one hell of a drug cocktail, makes you jump to the conclusions you want based on the flimsiest evidence. But the hangover is a bastard.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-22T00:37:05.137Z · LW · GW

Really, the race started more when OpenAI released GPT-4, it's been going on for a while, this is just another event that makes it clear.

Comment by William_S on On OpenAI’s Model Spec · 2024-06-22T00:16:18.663Z · LW · GW

Would be interesting philosophical experiment to have models trained on model spec v1 then try to improve their model spec for version v2, will this get better or go off the rails?

Comment by William_S on What distinguishes "early", "mid" and "end" games? · 2024-06-22T00:13:18.895Z · LW · GW

You get more discrete transitions when one s-curve process takes the lead from another s-curve process, e.g. deep learning taking over from other AI methods.

Comment by William_S on What distinguishes "early", "mid" and "end" games? · 2024-06-22T00:11:39.150Z · LW · GW

Probably shouldn't limit oneself from thinking only in terms of 3 game phases or fitting into one specific game, in general can have n-phases where different phrases have different characteristics.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-21T04:47:43.433Z · LW · GW

If anyone wants to work on this, there's a contest with $50K and $20K prizes for creating safety relevant benchmarks. https://www.mlsafety.org/safebench

Comment by William_S on Richard Ngo's Shortform · 2024-06-21T03:29:40.465Z · LW · GW

I think that's how people should generally react in the absence of harder commitments and accountability measures.

Comment by William_S on Richard Ngo's Shortform · 2024-06-21T03:25:23.198Z · LW · GW

I think the right way to think about verbal or written commitments is that they increase the costs of taking a certain course of action. A legal contract can mean that the price is civil lawsuits leading to paying a financial price. A non-legal commitment means if you break it, the person you made the commitment to gets angry at you, and you gain a reputation for being the sort of person who breaks commitments. It's always an option for someone to break the commitment and pay the price, even laws leading to criminal penalties can be broken if someone is willing to run the risk or pay the price. 

In this framework, it's reasonable to be somewhat angry at someone or some corporation who breaks a soft commitment to you, in order to increase the perceived cost of breaking soft commitments to you and people like you.

People on average maybe tend more towards keeping important commitments due to reputational and relationship cost, but maybe corporations as groups of people tend to think only in terms of financial and legal costs, so are maybe more willing to break soft commitments (especially, if it's an organization where one person makes the commitment but then other people break it). So for relating to corporations, you should be more skeptical of non-legally binding commitments (and even for legally binding commitments, pay attention to the real price of breaking it).

Comment by William_S on Richard Ngo's Shortform · 2024-06-21T02:30:45.558Z · LW · GW

Yeah, I think it's good if labs are willing to make more "cheap talk" statements of vague intentions, so you can learn how they think. Everyone should understand that these aren't real commitments, and not get annoyed if these don't end up meaning anything. This is probably the best way to view "statements by random lab employees".

Imo would be good to have more "changeable commitments" too in between, statements that are "we'll do policy X until we change the policy, when we do we commit to clearly informing everyone about the change" which is maybe more the current status of most RSPs.

Comment by William_S on William_S's Shortform · 2024-06-20T23:50:39.003Z · LW · GW

I'd have more confidence in Anthropic's governance if the board or LTBT had some fulltime independent members who weren't employees. IMO labs should consider paying a fulltime salary but no equity to board members, through some kind of mechanism where the money is still there and paid for X period of time in the future, even if the lab dissolved, so no incentive to avoid actions that would cost the lab. Board salaries could maybe be pegged to some level of technical employee salary, so that technical experts could take on board roles. Boards full of busy people really can't do their job of checking whether the organization is fullfilling its stated mission, and IMO this is one of the most important jobs in the world right now. Also, fulltime board members would have fewer conflicts of interest outside of the lab (since they won't be in some other fulltime job that might conflict).

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T22:20:15.944Z · LW · GW

Like, in Chess you start off with a state where many pieces can't move in the early game, in the middle game many pieces are in play moving around and trading, then in the end game it's only a few pieces, you know what the goal is, roughly how things will play out.

In AI it's like only a handful of players, then ChatGPT/GPT-4 came out and now everyone is rushing to get in (my mark of the start of the mid-game), but over time probably many players will become irrelevant or fold as the table stakes (training costs) get too high.

In my head the end-game is when the AIs themselves start becoming real players.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T22:12:39.994Z · LW · GW

Also you would need clarity on how to measure the commitment.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T22:06:26.969Z · LW · GW

It's quite possible that anthropic has some internal definition of "not meaningfully advancing the capabilities frontier" that is compatible with this release. But imo they shouldn't get any credit unless they explain it.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T22:04:06.448Z · LW · GW

Would be nice, but I was thinking of metrics that require "we've done the hard work of understanding our models and making them more reliable", better neuron explanation seems more like it's another smartness test.

Comment by William_S on Zach Stein-Perlman's Shortform · 2024-06-20T21:52:36.348Z · LW · GW

IMO it might be hard for Anthropic to communicate things about not racing because it might piss off their investors (even if in their hearts they don't want to race).

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T21:05:10.854Z · LW · GW

https://x.com/alexalbert__/status/1803837844798189580

Not sure about the accuracy of this graph, but the general picture seems to match what companies claim, and the vibe is racing.

Do think that there are distinct questions about "is there a race" vs. "will this race action lead to bad consequences" vs. "is this race action morally condemnable". I'm hoping that this race action is not too consequentially bad, maybe it's consequentially good, maybe it still has negative Shapely value even if expected value is okay. There is some sense in which it is morally icky.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T20:51:33.356Z · LW · GW

To be clear, I think the race was already kind of on, it's not clear how much this specific action gets credit assignment and it's spread out to some degree. Also not clear if there's really a viable alternative strategy here...

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T19:20:13.122Z · LW · GW

In my mental model, we're still in the mid-game, not yet in the end-game.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T19:19:18.699Z · LW · GW

Idk there's probably multiple ways to define racing, some of them are on at least

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T19:09:14.497Z · LW · GW

I'm disappointed that there weren't any non-capability metrics reported. IMO it would be good if companies could at least partly race and market on reliability metrics like "not hallucinating" and "not being easy to jailbreak".

Edit: As pointed out in reply, addendum contains metrics on refusals which show progress, yay! Broader point still stands, I wish there were more measurements and they were more prominent.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T19:06:13.342Z · LW · GW

The race is on.

Comment by William_S on Claude 3.5 Sonnet · 2024-06-20T19:01:29.344Z · LW · GW

IMO if any lab makes some kind of statement or commitment, you should treat this as "we think right now that we'll want to do this in the future unless it's hard or costly", unless you can actually see how you would sue them or cause a regulator to fine them if they violate the commitment. This doesn't mean weaker statements have no value.

Comment by William_S on Ilya Sutskever created a new AGI startup · 2024-06-19T18:12:16.764Z · LW · GW

If anyone says "We plan to advance capabilities as fast as possible while making sure our safety always remains ahead." you should really ask for the details of what this means, how to measure whether safety is ahead. (E.g. is it "we did the bare minimum to make this product tolerable to society" vs. "we realize how hard superalignment will be and will be investing enough to have independent experts agree we have a 90% chance of being able to solve superalignment before we build something dangerous")

Comment by William_S on Ilya Sutskever created a new AGI startup · 2024-06-19T18:06:59.795Z · LW · GW

I do hope he will continue to contribute to the field of alignment research.

Comment by William_S on Ilya Sutskever created a new AGI startup · 2024-06-19T18:02:39.548Z · LW · GW

I don't trust Ilya Sutskever to be the final arbiter of whether a Superintelligent AI design is safe and aligned. We shouldn't trust any individual, especially if they are the ones building such a system to claim that they've figured out how to make it safe and aligned. At minimum, there should be a plan that passes review by a panel of independent technical experts. And most of this plan should be in place and reviewed before you build the dangerous system.

Comment by William_S on Boycott OpenAI · 2024-06-19T17:49:15.486Z · LW · GW

In my opinion, it's reasonable to change which companies you want to do business with, but it would be more helpful to write letters to politicians in favor of reasonable AI regulation (e.g. SB 1047, with suggested amendments if you have concerns about the current draft). I think it's bad if the public has to play the game of trying to pick which AI developer seems the most responsible, better to try to change the rules of the game so that isn't necessary.

Also it's generally helpful to write about which labs seem more responsible/less responsible (which you are doing here), what you think labs should do instead of current practices. Bonus points for designing ways to test which deployed models are more safe and reliable, e.g. writing some prompts to use as litmus tests.

Comment by William_S on Non-Disparagement Canaries for OpenAI · 2024-06-03T23:46:54.631Z · LW · GW

Language in the emails included:

"If you executed the Agreement, we write to notify you that OpenAI does not intend to enforce the Agreement"

I assume this also communicates that OpenAI doesn't intend to enforce the self-confidentiality clause in the agreement

Comment by William_S on Non-Disparagement Canaries for OpenAI · 2024-06-03T23:41:32.821Z · LW · GW

Evidence could look like 1. Someone was in a position where they had to make a judgement about OpenAI and was in a position of trust 2. They said something bland and inoffensive about OpenAI 3. Later, independently you find that they likely would have known about something bad that they likely weren't saying because of the nondisparagement agreement (instead of ordinary confidentially agreements).

This requires some model of "this specific statement was influenced by the agreement" instead of just "you never said anything bad about OpenAI because you never gave opinions on OpenAI".

I think one should require this kind of positive evidence before calling it a "serious breach of trust", but people can make their own judgement about where that bar should be.

Comment by William_S on Non-Disparagement Canaries for OpenAI · 2024-06-03T21:11:12.892Z · LW · GW

I imagine many of the people going into leadership positions were prepared to ignore the contract, or maybe even forgot about the nondisparagement clause altogether. The clause is also open to more avenues of legal attack if it's enforced against someone who takes another position which requires disparagement (e.g. if it's argued to be a restriction on engaging in business). And if any individual involved divested themselves of equity before taking up another position, there would be fewer ways for the company to retaliate against them. I don't think it's fair to view this as a serious breach of trust on behalf of any individual, without clear evidence that it impacted their decisions or communication. 

But it is fair to view the situation overall as concerning that it could happen with nobody noticing, and try to design defenses to prevent this or similar things happening in the future, e.g. some clear statement around not having any conflict of interest including legal obligations for people going into independent leadership positions, as well as a consistent divestment policy (though that creates its own wierd incentives).

Comment by William_S on Daniel Kokotajlo's Shortform · 2024-05-27T04:09:27.070Z · LW · GW

There is no Golden Gate Bridge division

Comment by William_S on Daniel Kokotajlo's Shortform · 2024-05-27T04:08:52.057Z · LW · GW

Would make a great SCP

Comment by William_S on William_S's Shortform · 2024-05-03T18:18:56.083Z · LW · GW

No comment.

Comment by William_S on William_S's Shortform · 2024-05-03T18:14:09.450Z · LW · GW

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

Comment by William_S on Robustness of Model-Graded Evaluations and Automated Interpretability · 2023-07-18T22:56:10.219Z · LW · GW

Re: hidden messages in neuron explanations, yes it seems like a possible problem. A way to try to avoid this is to train the simulator model to imitate what a human would say given the explanation. A human would ignore the coded message, and so the trained simulator model should also ignore the coded message. (this maybe doesn't account for adversarial attacks on the trained simulator model, so might need ordinary adversarial robustness methods).

Does seem like if you ever catch your interpretability assistant trying to hide messages, you should stop and try to figure out what is going on, and that might be sufficient evidence of deception.

Comment by William_S on William_S's Shortform · 2023-03-22T18:13:19.260Z · LW · GW

From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.

However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens

Comment by William_S on Common misconceptions about OpenAI · 2022-09-03T16:20:08.596Z · LW · GW

(I work at OpenAI). Is the main thing you think has the effect of safetywashing here the claim that the misconceptions are common? Like if the post was "some misconceptions I've encountered about OpenAI" it would mostly not have that effect? (Point 2 was edited to clarify that it wasn't a full account of the Anthropic split.)

Comment by William_S on OpenAI's Alignment Plans · 2022-08-25T16:50:33.040Z · LW · GW

Jan Leike has written about inner alignment here https://aligned.substack.com/p/inner-alignment. (I'm at OpenAI, imo I'm not sure if this will work in the worst case and I'm hoping we can come up with a more robust plan)

Comment by William_S on Oversight Misses 100% of Thoughts The AI Does Not Think · 2022-08-15T16:58:23.040Z · LW · GW

So I do think you can get feedback on the related question of "can you write a critique of this action that makes us think we wouldn't be happy with the outcomes" as you can give a reward of 1 if you're unhappy with the outcomes after seeing the critique, 0 otherwise.

And this alone isn't sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn't be happy with the outcome, which is then where you'd need to get into recursive evaluation or debate or something. But this feels like "hard but potentially tractable problem" and not "100% doomed". Or at least the failure story needs to involve more steps like "sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it's bad" or "the consequences are so complicated the system can't explain them to us in the critique and get high reward for it"

ETA: So I'm assuming the story for feedback on reliably doing things in the world you're referring to is something like "we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates" or something like that, and I agree this is easier than "are we actually happy with the outcome"

Comment by William_S on Oversight Misses 100% of Thoughts The AI Does Not Think · 2022-08-12T23:11:54.021Z · LW · GW

If we can't get the AI to answer something like "If we take the action you just proposed, will we be happy with the outcomes?", why can we get it to also answer the question of "how do you design a fusion power generator?" to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn't actually work?

Comment by William_S on Reverse-engineering using interpretability · 2022-08-04T21:37:50.469Z · LW · GW

Image link is broken

Comment by William_S on Robustness to Scaling Down: More Important Than I Thought · 2022-07-25T19:09:18.408Z · LW · GW

Yep, that clarifies.