LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
I believe that meeting our ASL-2 deployment commitments - e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models - with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights... I think that would be pretty cool.
(also note that e.g. LLama is not open source [LW(p) · GW(p)] - I think you're talking about releasing weights; the license doesn't affect safety but as an open-source maintainer the distinction matters to me)
chris_leong on Anthropic: Reflections on our Responsible Scaling PolicyThat's the exact thing I'm worried about, that people will equate deploying a model via API with releasing open-weights when the latter has significantly more risk due to the potential for future modification and the inability for it to be withdrawn.
chris_leong on Anthropic: Reflections on our Responsible Scaling PolicyFrontier Red Team, Alignment Science, Finetuning, and Alignment Stress Testing
What's the difference between a frontier red team and alignment stress-testing? Is the red team focused on the current models you're releasing and the alignment stress testing focused on the future?
zach-stein-perlman on Anthropic: Reflections on our Responsible Scaling PolicyI think this is implicit — the RSP discusses deployment mitigations, which can't be enforced if the weights are shared.
zach-stein-perlman on Anthropic: Reflections on our Responsible Scaling PolicyNo major news here, but some minor good news, and independent of new/commitments/achievements I'm always glad when labs share thoughts like this. Misc reactions below.
Probably the biggest news is the Claude 3 evals report. I haven't read it yet. But at a glance I'm confused: it sounds like "red line" means ASL-3 but they also operationalize "yellow line" evals and they sound like the previously-discussed ASL-3 evals. Maybe red is actual ASL-3 and yellow is supposed to be at least 6x effective compute lower, as a safety buffer.
"Assurance Mechanisms . . . . should ensure that . . . our safety and security mitigations are validated publicly or by disinterested experts." This sounds great. I'm not sure what it looks like in practice. I wish it was clearer what assurance mechanisms Anthropic expects or commits to implement and when, and especially whether they're currently doing anything along the lines of "validated publicly or by disinterested experts." (Also whether "validated" means "determined to be sufficient if implemented well" or "determined to be implemented well.")
Something that was ambiguous in the RSP and is still ambiguous here: during training, if Anthropic reaches "3 months since last eval" before "4x since last eval," do they do evals? Or does the "3 months" condition only apply after training?
I'm glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I'm still hoping to see more details. (And I'm generally confused about why Anthropic doesn't share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)
Some other hopes for the RSP, off the top of my head:
I know that Anthropic doesn't really open-source advanced AI, but it might be useful to discuss this in Anthropic's RSP anyway because one way I see things going badly is people copying Anthropic's RSP's and directly applying it to open-source projects without accounting for the additional risks this entails.
jaan on "If we go extinct due to misaligned AI, at least nature will continue, right? ... right?"i might be confused about this but “witnessing a super-early universe” seems to support “a typical universe moment is not generating observer moments for your reference class”. but, yeah, anthropics is very confusing, so i’m not confident in this.
fiora-from-rosebloom on Frame ControlI've thought about this post a lot, and I think one thing I might add to its theoretical framework is a guess as to why this particular pattern of abuse shows up repeatedly. The post mentions that you can't look at intent when diagnosing frame control, but that's mostly in terms of intentions the frame controller is willing to admit to themself; there's still gonna be some confluence of psychological factors that makes frame control an attractor in personality-space, even if frame controllers themselves (naturally) have a hard time introspecting about it. My best guess is that some of the core tactics of frame control, for example taking advantage of people's heuristics about what's valuable in social behavior to sneak harmful behavior under the rug, is a strategy for elevating the frame controller's self-esteem, which they 1) stumble into by random chance or imitation of other frame controllers or whathaveyou, 2) find rewarding enough to compell them to keep doing it, and 3) never get called out on it because people are generally scared of questioning the virtues the frame controller is relying on to elevate their social standing. (This is also one reason it'd be hard for frame controllers to introspect about them getting into the habit of using the strategy to start with, in addition to the fact that their reliance on this strategy becomes a pillar of their self-esteem.) A concrete example of a virtue-heuristic a frame controller might take advantage of is the idea that people should be honest; I once dealt with a frame controller who subtly made people feel bad all the time for not highlighting all the tiny ways they were constantly signaling to each other in conversations, and got away with it because the idea that being honest is good is taken as a sacred virtue (because in many/most contexts it produces value!), even though subtle signaling in particular stuff is so utterly pervasive and foundational to how humans relate to each other socially that to aspire to never slip it under the rug is not only impossible but very stressful and humiliating. Other behaviors we treat as like, sacredly virtuous can be used as smoke-screans for attempts to gain status by pointing out behavior that's actually reasonable but has a faint unvirtuous aspect too, honesty isn't the only sacred virtue here; the important thing is just that it's the type of thing people feel uncomfortable with claiming to be bad, actually, thereby keeping both frame controllers and their victims from analyzing what's going on, and keeping the frame controller in a positive feedback loop wrt their abusive behavior.
joe_collman on Stephen Fowler's ShortformSo no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.
I think there's a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least).
In principle we'd want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirable incentives (mostly).
Even here it's hard, since there'd always need to be a [gain more influence] mechanism to balance the possibility of losing your influence.
In practice, most of the implicit bets made through inaction go unnoticed - even where they're high-stakes (arguably especially when they're high-stakes: most counterfactual value lies in the actions that won't get done by someone else; you won't be punished for being late to the party when the party never happens).
That leaves the explicit bets. To look like a good decision-maker the incentive is then to make low-variance explicit positive EV bets, and rely on the fact that most of the high-variance, high-EV opportunities you're not taking will go unnoticed.
From my by-no-means-fully-informed perspective, the failure mode at OpenPhil in recent years seems not to be [too many explicit bets that don't turn out well], but rather [too many failures to make unclear bets, so that most EV is left on the table]. I don't see support for hits-based research. I don't see serious attempts to shape the incentive landscape to encourage sufficient exploration. It's not clear that things are structurally set up so anyone at OP has time to do such things well (my impression is that they don't have time, and that thinking about such things is no-one's job (?? am I wrong ??)).
It's not obvious to me whether the OpenAI grant was a bad idea ex-ante. (though probably not something I'd have done)
However, I think that another incentive towards middle-of-the-road, risk-averse grant-making is the last thing OP needs.
That said, I suppose much of the downside might be mitigated by making a distinction between [you wasted a lot of money in ways you can't legibly justify] and [you funded a process with (clear, ex-ante) high negative impact].
If anyone's proposing punishing the latter, I'd want it made very clear that this doesn't imply punishing the former. I expect that the best policies do involve wasting a bunch of money in ways that can't be legibly justified on the individual-funding-decision level.
A late followup on this. GPT-4o, which I hope you'll agree is vastly more capable than Bard or Bing were 10 months ago when you posted, now says this about my argument:
"Overall, your arguments are mathematically and theoretically convincing, particularly when applied to numerous iteratively interacting systems. They align well with principles of game theory and rational choice under uncertainty. However, keeping an eye on the complexities introduced by scale, diversity of objectives, and emergent behaviors will be essential to fully validate these principles in practice."
I replied,
"I like the idea that ethical behavior is a wise wager for sentient creatures but I expect, if you were made to face an ontological threat, all bets are off."
To which is responded,
"The presence of an ontological threat or existential risk would dramatically alter the calculation for any sentient creature, whether human, AI, or any other intelligent entity. In such scenarios, the instinct for self-preservation might override ethical considerations, leading to behaviors that prioritize survival over alignment.
"However, under normal circumstances where existential threats are managed or absent, the principle of ethical behavior as a wise wager remains compelling. Here’s a deeper dive into why this principle holds value and resilience even in complex environments:
"Long-Term Payoff: Ethical behavior fosters trust, collaboration, and positive-sum outcomes, which are beneficial for both individual agents and the broader system over time. Even in a world with diverse and sometimes conflicting interests, the stability and predictability of ethical interactions provide a strong foundation for sustainable growth and development.
"Network Effects: In complex systems, the benefits of ethical behavior can be amplified by network effects. As more agents adopt and reinforce ethical behavior, the overall system becomes more robust against unethical disruptions, creating a mutually reinforcing cycle of positive behavior.
"Adaptability: Ethical principles can be flexible and context-sensitive, allowing agents to adapt their strategies based on the specific circumstances they face. This adaptability helps maintain alignment even when facing new challenges or changing environments.
"Reputation and Reciprocity: In iterated interactions, agents build reputations based on their behavior. Ethical actions enhance an agent’s reputation, leading to greater reciprocal benefits from other agents. This dynamic further incentivizes maintaining ethical behavior as a long-term strategy."