LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
I've recently tried to play this again with @Towards_Keeperhood [LW · GW]. We think it was still working a year ago. He would be happy to pay a 50$ bounty for this to get fixed by reverting it to the previous version (or whatever happened there). If the code was public that would also be helpful, because then I might get to fixing it.
morpheus on Rationality CardinalityI've recently tried to play this again with @Towards_Keeperhood [LW · GW]. We think it was still working a year ago. He would be happy to pay a 50$ bounty for this to get fixed by reverting it to the previous version (or whatever happened there). If the code was public that would also be helpful, because then I might get to fixing it.
zac-hatfield-dodds on Anthropic: Reflections on our Responsible Scaling PolicyThe yellow-line evals are already a buffer ('sufficent to rule out red-lines') which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it's reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we'd re-run them ahead of schedule.
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we'd develop for the next round, and enough calibration training to avoid very low probabilities.
Finally, the point of these estimates is that they can guide research and development prioritization - high estimates suggest that it's worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.
aaron_scher on Strong Evidence is CommonI appreciate this comment, especially #3, for voicing some of why this post hasn't clicked for me.
The interesting hypotheses/questions seem to rarely have strong evidence. But I guess this is partially a selection effect where questions become less interesting by virtue of me being able to get strong evidence about them, no use dwelling on the things I'm highly confident about. Some example hypotheses that I would like to get evidence about but which seem unlikely to have strong evidence: Sam Altman is a highly deceptive individual, far more deceptive than the average startup CEO. I work better when taking X prescribed medication. I would more positively influence the far future if I worked on field building rather than technical research.
o-o on What's Going on With OpenAI's Messaging?I recall him saying this on Twitter and linking a person in a leadership position who runs things there. Don’t know how to search that.
akash-wasil on Anthropic: Reflections on our Responsible Scaling PolicyThat really seems more like a question for governments than for Anthropic
+1. I do want governments to take this question seriously. It seems plausible to me that Anthropic (and other labs) could play an important role in helping governments improve its ability to detect/process information about AI risks, though.
it's not clear why the government would get involved in a matter of voluntary commitments by a private organization
Makes sense. I'm less interested in a reporting system that's like "tell the government that someone is breaking an RSP" and more interested in a reporting system that's like "tell the government if you are worried about an AI-related national security risk, regardless of whether or not this risk is based on a company breaking its voluntary commitments."
My guess is that existing whistleblowing programs are the best bet right now, but it's unclear to me whether they are staffed by people who understand AI risks well enough to know how to interpret/process/escalate such information (assuming the information ought to be escalated).
akash-wasil on New voluntary commitments (AI Seoul Summit)a pretty specific framework with unique strengths I wouldn't want overlooked.
What are some of the unique strengths of the framework that you think might get overlooked if we go with something more like "voluntary safety commitments" or "voluntary scaling commitments"?
(Ex: It seems plausible to me that you want to keep the word "scaling" in, since there are lots of safety commitments that could plausibly have nothing to do with future models, and "scaling" sort of forces you to think about what's going to happen as models get more powerful.)
akash-wasil on New voluntary commitments (AI Seoul Summit)You can still have the RSP commitment rule be a foundation for actually effective policies down the line
+1. I do think it's worth noting, though, that RSPs might not be a sensible foundation for effective policies.
One of my colleagues recently mentioned that the voluntary commitments from labs are much weaker than some of the things that the G7 Hiroshima Process has been working on.
More tangibly, it's quite plausible to me that policymakers who think about AI risks from first principles would produce things that are better and stronger than "codify RSPs." Some thoughts:
But there are a lot of implicit assumptions in the RSP frame like "we need to have empirical evidence of risk before we do anything" (as opposed to an affirmative safety frame), "we just need to make sure we implement the right safeguards once things get dangerous" (as opposed to a frame that recognizes we might not have time to develop such safeguards once we have clear evidence of danger), and "AI development should roughly continue as planned" (as opposed to a frame that considers alternative models, like public-private partnerships).
More concretely, I would rather see policy based on things like the recent Bengio paper than RSPs. Examples:
Despite evaluations, we cannot consider coming powerful frontier AI systems “safe unless proven unsafe.” With present testing methodologies, issues can easily be missed. Additionally, it is unclear whether governments can quickly build the immense expertise needed for reliable technical evaluations of AI capabilities and societal-scale risks. Given this, developers of frontier AI should carry the burden of proof to demonstrate that their plans keep risks within acceptable limits. By doing so, they would follow best practices for risk management from industries, such as aviation, medical devices, and defense software, in which companies make safety cases
Commensurate mitigations are needed for exceptionally capable future AI systems, such as autonomous systems that could circumvent human control. Governments must be prepared to license their development, restrict their autonomy in key societal roles, halt their development and deployment in response to worrying capabilities, mandate access controls, and require information security measures robust to state-level hackers until adequate protections are ready. Governments should build these capacities now.
Sometimes advocates of RSPs say "these are things that are compatible with RSPs", but overall I have not seen RSPs/PFs/FSFs that are nearly this clear about the risks, this clear about the limitations of model evaluations, or this clear about the need for tangible regulations.
I've feared previously (and continue to fear) that there are some motte-and-bailey dynamics at play with RSPs, where proponents of RSPs say privately (and to safety people) that RSPs are meant to have strong commitments and inspire strong regulation, but then in practice the RSPs are very weak and end up conveying and overly-rosy picture to policymakers.
zac-hatfield-dodds on Anthropic: Reflections on our Responsible Scaling PolicyWhat about whistleblowing or anonymous reporting to governments? If an Anthropic employee was so concerned about RSP implementation (or more broadly about models that had the potential to cause major national or global security threats), where would they go in the status quo?
That really seems more like a question for governments than for Anthropic! For example, the SEC or IRS whistleblower programs operate regardless of what companies puport to "allow", and I think it'd be cool if the AISI had something similar.
If I was currently concerned about RSP implementation per se (I'm not), it's not clear why the government would get involved in a matter of voluntary commitments by a private organization. If there was some concern touching on the White House committments, Bletchley declaration, Seoul declaration, etc., then I'd look up the appropriate monitoring body; if in doubt the Commerce whistleblower office or AISI seem like reasonable starting points.
nim on My Dating HeuristicMagnificent, and thank you for sharing! I was curious who your youtube link would be about trusted sources and delighted to see Dr. K's channel on the mouseover.