LessWrong 2.0 Reader
View: New · Old · Top← previous page (newer posts) · next page (older posts) →
← previous page (newer posts) · next page (older posts) →
I've thought about this post a lot, and I think one thing I might add to its theoretical framework is a guess as to why this particular pattern of abuse shows up repeatedly. The post mentions that you can't look at intent when diagnosing frame control, but that's mostly in terms of intentions the frame controller is willing to admit to themself; there's still gonna be some confluence of psychological factors that makes frame control an attractor in personality-space, even if frame controllers themselves (naturally) have a hard time introspecting about it. My best guess is that some of the core tactics of frame control, for example taking advantage of people's heuristics about what's valuable in social behavior to sneak harmful behavior under the rug, is a strategy for elevating the frame controller's self-esteem, which they 1) stumble into by random chance or imitation of other frame controllers or whathaveyou, 2) find rewarding enough to compell them to keep doing it, and 3) never get called out on it because people are generally scared of questioning the virtues the frame controller is relying on to elevate their social standing. (This is also one reason it'd be hard for frame controllers to introspect about them getting into the habit of using the strategy to start with, in addition to the fact that their reliance on this strategy becomes a pillar of their self-esteem.) A concrete example of a virtue-heuristic a frame controller might take advantage of is the idea that people should be honest; I once dealt with a frame controller who subtly made people feel bad all the time for not highlighting all the tiny ways people are constantly signaling to each other in conversations, and got away with it because the idea that being honest is good is extremely solid in most contexts, even though subtle signaling stuff is so utterly pervasive and foundational to how humans relate to each other socially that to aspire to never slip it under the rug is not only impossible but very stressful and humiliating. Other behaviors we treat as like, sacredly virtuous can be used as smoke-screans for attempts to gain status by pointing out behavior that's actually reasonable but has a faint unvirtuous aspect too, honesty isn't the only ~sacred virtue here; the important thing is just that it's the type of thing people feel uncomfortable with claiming to be bad, actually, thereby keeping both frame controllers and their victims from analyzing what's going on, and keeping the frame controller in a positive feedback loop wrt their abusive behavior.
joe_collman on Stephen Fowler's ShortformSo no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.
I think there's a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least).
In principle we'd want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirable incentives (mostly).
Even here it's hard, since there'd always need to be a [gain more influence] mechanism to balance the possibility of losing your influence.
In practice, most of the implicit bets made through inaction go unnoticed - even where they're high-stakes (arguably especially when they're high-stakes: most counterfactual value lies in the actions that won't get done by someone else; you won't be punished for being late to the party when the party never happens).
That leaves the explicit bets. To look like a good decision-maker the incentive is then to make low-variance explicit positive EV bets, and rely on the fact that most of the high-variance, high-EV opportunities you're not taking will go unnoticed.
From my by-no-means-fully-informed perspective, the failure mode at OpenPhil in recent years seems not to be [too many explicit bets that don't turn out well], but rather [too many failures to make unclear bets, so that most EV is left on the table]. I don't see support for hits-based research. I don't see serious attempts to shape the incentive landscape to encourage sufficient exploration. It's not clear that things are structurally set up so anyone at OP has time to do such things well (my impression is that they don't have time, and that thinking about such things is no-one's job (?? am I wrong ??)).
It's not obvious to me whether the OpenAI grant was a bad idea ex-ante. (though probably not something I'd have done)
However, I think that another incentive towards middle-of-the-road, risk-averse grant-making is the last thing OP needs.
That said, I suppose much of the downside might be mitigated by making a distinction between [you wasted a lot of money in ways you can't legibly justify] and [you funded a process with (clear, ex-ante) high negative impact].
If anyone's proposing punishing the latter, I'd want it made very clear that this doesn't imply punishing the former. I expect that the best policies do involve wasting a bunch of money in ways that can't be legibly justified on the individual-funding-decision level.
A late followup on this. GPT-4o, which I hope you'll agree is vastly more capable than Bard or Bing were 10 months ago when you posted, now says this about my argument:
"Overall, your arguments are mathematically and theoretically convincing, particularly when applied to numerous iteratively interacting systems. They align well with principles of game theory and rational choice under uncertainty. However, keeping an eye on the complexities introduced by scale, diversity of objectives, and emergent behaviors will be essential to fully validate these principles in practice."
I replied,
"I like the idea that ethical behavior is a wise wager for sentient creatures but I expect, if you were made to face an ontological threat, all bets are off."
To which is responded,
"The presence of an ontological threat or existential risk would dramatically alter the calculation for any sentient creature, whether human, AI, or any other intelligent entity. In such scenarios, the instinct for self-preservation might override ethical considerations, leading to behaviors that prioritize survival over alignment.
"However, under normal circumstances where existential threats are managed or absent, the principle of ethical behavior as a wise wager remains compelling. Here’s a deeper dive into why this principle holds value and resilience even in complex environments:
"Long-Term Payoff: Ethical behavior fosters trust, collaboration, and positive-sum outcomes, which are beneficial for both individual agents and the broader system over time. Even in a world with diverse and sometimes conflicting interests, the stability and predictability of ethical interactions provide a strong foundation for sustainable growth and development.
"Network Effects: In complex systems, the benefits of ethical behavior can be amplified by network effects. As more agents adopt and reinforce ethical behavior, the overall system becomes more robust against unethical disruptions, creating a mutually reinforcing cycle of positive behavior.
"Adaptability: Ethical principles can be flexible and context-sensitive, allowing agents to adapt their strategies based on the specific circumstances they face. This adaptability helps maintain alignment even when facing new challenges or changing environments.
"Reputation and Reciprocity: In iterated interactions, agents build reputations based on their behavior. Ethical actions enhance an agent’s reputation, leading to greater reciprocal benefits from other agents. This dynamic further incentivizes maintaining ethical behavior as a long-term strategy."
A quote from an old Nate Soares post that I really liked:
alenglander on Some "meta-cruxes" for AI x-risk debatesIt is there, while staring the dark world in the face, that I find a deep well of intrinsic drive. It is there that my resolve and determination come to me, rather than me having to go hunting for them.
I find it amusing that "we need lies because we can't bear the truth" is such a common refrain, given how much of my drive stems from my response to attempting to bear the truth.
I find that it's common for people to tell themselves that they need the lies in order to bear reality. In fact, I bet that many of you can think of one thing off the top of your heads that you're intentionally tolerifying, because the truth is too scary to even consider. (I've seen at least a dozen failed relationships dragged out for months and months due to this effect.)
I say, if you want the intrinsic drive, drop the illusion. Refuse to tolerify. Face the facts that you feared you would not be able to handle. You are likely correct that they will be hard to bear, and you are likely correct that attempting to bear them will change you. But that change doesn't need to break you. It can also make you stronger, and fuel your resolve.
So see the dark world. See everything intolerable. Let the urge to tolerify it build, but don't relent. Just live there in the intolerable world, refusing to tolerate it. See whether you feel that growing, burning desire to make the world be different. Let parts of yourself harden. Let your resolve grow. It is here, in the face of the intolerable, that you will be able to tap into intrinsic motivation.
I agree that the first can be framed as a meta-crux, but actually I think the way you framed it is more of an object-level forecasting question, or perhaps a strong prior on the forecasted effects of technological progress. If on the other hand you framed it more as conflict theory vs. mistake theory [? · GW], then I'd say that's more on the meta level.
For the second, I agree that's for some people, but I'm skeptical of how prevalent the cosmopolitan view is, which is why I didn't include it in the post.
cody-rushing on Stephen Fowler's Shortform
- Less important, but the grant justification appears to take seriously the idea that making AGI open source is compatible with safety. I might be missing some key insight, but it seems trivially obvious why this is a terrible idea even if you're only concerned with human misuse and not misalignment.
Hmmm, can you point to where you think the grant shows this? I think the following paragraph from the grant seems to indicate otherwise:
chipmonk on Some Things That Increase Blood Flow to the BrainWhen OpenAI launched, it characterized the nature of the risks – and the most appropriate strategies for reducing them – in a way that we disagreed with. In particular, it emphasized the importance of distributing AI broadly;1 our current view is that this may turn out to be a promising strategy for reducing potential risks, but that the opposite may also turn out to be true (for example, if it ends up being important for institutions to keep some major breakthroughs secure to prevent misuse and/or to prevent accidents). Since then, OpenAI has put out more recent content consistent with the latter view,2 and we are no longer aware of any clear disagreements. However, it does seem that our starting assumptions and biases on this topic are likely to be different from those of OpenAI’s leadership, and we won’t be surprised if there are disagreements in the future.
Update: I resolved maybe all of my neck tension and vagus nerve tension. I don't know how to tell whether this increased by intelligence though. It's also not like I had headaches or anything obvious like that before
review-bot on Language Models Model UsThe LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?
chipmonk on Transformers Represent Belief State Geometry in their Residual Streamthis post seems like a win for PIBBSS gee
ebenezer-dukakis on robo's ShortformIf LW takes this route, it should be cognizant of the usual challenges of getting involved in politics. I think there's a very good chance of evaporative cooling, where people trying to see AI clearly gradually leave, and are replaced by activists. The current reaction to OpenAI events is already seeming fairly tribal IMO.