LessWrong 2.0 Reader
View: New · Old · Topnext page (older posts) →
next page (older posts) →
I hope it would, but I actually think it would depend on who or what killed whom, how, and whether it was really an accident at all.
If an American-made AI hacked the DOD and nuked Milan because someone asked it to find a way to get the 2026 Olympics moved, then I agree, we would probably get a push back against race incentives.
If a Chinese-made AI killed millions in Taiwan in an effort create an opportunity for China to seize control, that could possibly *accelerate* race dynamics.
johnswentworth on johnswentworth's ShortformJust because the number of almost-orthogonal vectors in d dimensions scales exponentially with d, doesn't mean one can choose all those signals independently. We can still only choose d real-valued signals at a time (assuming away the sort of tricks by which one encodes two real numbers in a single real number, which seems unlikely to happen naturally in the body). So "more intended behaviors than input-vector components" just isn't an option, unless you're exploiting some kind of low-information-density in the desired behaviors (like e.g. very "sparse activation" of the desired behaviors, or discreteness of the desired behaviors to a limited extent).
johnswentworth on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal GoalsTBC, I don't particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up "property rights". So, a system can be generally corrigible by "respecting the convergent property rights", so to speak.
milan-w on Jakub Halmeš's ShortformIt may just use l33tc0d3 or palabres nonexistentes interpolatas d'idioms prossimos.
max-harms on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal GoalsThis seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you're dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price "reputation damage" and instead you might want to be overly cautious (i.e. "impose constraints") and think through deviations from that cautious baseline on a case-by-case basis (i.e. "forward-check"). Again, I think your point is mostly right, and a useful frame -- it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.
buck on Ten people on the insideYep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.
What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.
I think you're overfixating on the experience of Google, which has more complicated production systems than most.
cbiddulph on Why not train reasoning models with RLHF?Thanks! Apparently I should go read the r1 paper :)
sharmake-farah on Should you publish solutions to corrigibility?The answer depends on your values, and thus there isn't really a single answer to be said here.
seth-herd on Should you publish solutions to corrigibility?I see. I think about 99% of humanity at the very least are not so sadistic as to create a future with less than zero utility. Sociopaths are something like ten percent of the population, but like everything else it's on a spectrum. Sociopaths usually also have some measure of empathy and desire for approval. In a world where they've won, I think most of them would rather be bailed as a hero than be an eternal sadistic despot. Sociopathathy doesn't automatically include a lot of sadism, just desire for revenge against perceived enemies.
So I'd take my chances with a human overlord far before accepting extinction.
Note that our light cone with zero value might also eclipse other light cones that might've had value if we didn't let our AGI go rogue to avoid s-risk.
milan-w on Kajus's ShortformThe incentives for early career researchers are to blame for this mindset imo. Having legible output is a very good signal of competence for employers/grantors. I think it probably makes sense for the first or first couple project of a researcher to be more of a cool demo than clear steps towards a solution.
Unfortunately, some middle career and sometimes even senior researchers keep this habit of forward-chaining from what looks cool instead of backwards-chaining from good futures. Ok, the previous sentence was a bit too strong. No reasoning is pure backward-chaining or pure forward-chaining. But I think that a common failure mode is not thinking enough about theories of change.