LessWrong 2.0 Reader
View: New · Old · Topnext page (older posts) →
next page (older posts) →
Just because the number of almost-orthogonal vectors in d dimensions scales exponentially with d, doesn't mean one can choose all those signals independently. We can still only choose d real-valued signals at a time (assuming away the sort of tricks by which one encodes two real numbers in a single real number, which seems unlikely to happen naturally in the body). So "more intended behaviors than input-vector components" just isn't an option, unless you're exploiting some kind of low-information-density in the desired behaviors (like e.g. very "sparse activation" of the desired behaviors, or discreteness of the desired behaviors to a limited extent).
johnswentworth on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal GoalsTBC, I don't particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up "property rights". So, a system can be generally corrigible by "respecting the convergent property rights", so to speak.
milan-w on Jakub Halmeš's ShortformIt may just use l33tc0d3 or palabres nonexistentes interpolatas d'idioms prossimos.
max-harms on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal GoalsThis seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you're dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price "reputation damage" and instead you might want to be overly cautious (i.e. "impose constraints") and think through deviations from that cautious baseline on a case-by-case basis (i.e. "forward-check"). Again, I think your point is mostly right, and a useful frame -- it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.
buck on Ten people on the insideYep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.
What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.
I think you're overfixating on the experience of Google, which has more complicated production systems than most.
cbiddulph on Why not train reasoning models with RLHF?Thanks! Apparently I should go read the r1 paper :)
sharmake-farah on Should you publish solutions to corrigibility?The answer depends on your values, and thus there isn't really a single answer to be said here.
seth-herd on Should you publish solutions to corrigibility?I see. I think about 99% of humanity at the very least are not so sadistic as to create a future with less than zero utility. Sociopaths are something like ten percent of the population, but like everything else it's on a spectrum. Sociopaths usually also have some measure of empathy and desire for approval. In a world where they've won, I think most of them would rather be bailed as a hero than be an eternal sadistic despot. Sociopathathy doesn't automatically include a lot of sadism, just desire for revenge against perceived enemies.
So I'd take my chances with a human overlord far before accepting extinction.
Note that our light cone with zero value might also eclipse other light cones that might've had value if we didn't let our AGI go rogue to avoid s-risk.
milan-w on Kajus's ShortformThe incentives for early career researchers are to blame for this mindset imo. Having legible output is a very good signal of competence for employers/grantors. I think it probably makes sense for the first or first couple project of a researcher to be more of a cool demo than clear steps towards a solution.
Unfortunately, some middle career and sometimes even senior researchers keep this habit of forward-chaining from what looks cool instead of backwards-chaining from good futures. Ok, the previous sentence was a bit too strong. No reasoning is pure backward-chaining or pure forward-chaining. But I think that a common failure mode is not thinking enough about theories of change.
I think it's more a matter of Not Enough Dakka plus making it illegal to do those things in what should be reasonable ways. I agree there are economic (and regulatory) interventions that could make an enormous difference, but for various reasons I don't think any government is currently willing and able to implement them at scale. A crisis needs to be a lot more acute to motivate that scale of change.