Posts
Comments
I agree with your points about avoiding political polarisation and allowing people with different ideological positions to collaborate on alignment. I'm not sure about the idea that aligning to a single group's values (or to a coherent ideology) is technically easier than a more vague 'align to humanity's values' goal.
Groups rarely have clearly articulated ideologies - more like vibes which everyone normally gets behind. An alignment approach from clearly spelling out what you consider valuable doesn't seem likely to work. Looking to existing models which have been aligned to some degree through safety testing, the work doesn't take the form of injecting a clear structured value set. Instead, large numbers of people with differing opinions and world views continually correct the system until it generally behaves itself. This seems far more pluralistic than 'alignment to one group' suggests.
This comes with the caveat that these systems are created and safety tested by people with highly abnormal attitudes when compared to the rest of their species. But sourcing viewpoints from outside seems to be an organisational issue rather than a technical one.
One method of keeping humans in key industrial processes might be expanding credentialism. Individuals remaining control even when the majority of the thinking isn't done by them has always been a key part of any hierarchical organisation.
Legally speaking, certain key tasks can only be performed by qualified accountants, auditors, lawyers, doctors, elected officials and so on.
It would not be good for short term economic growth. However, legally requiring that certain tasks be performed by people with credentials machines are not eligible for might be a good (though absolutely not perfect) way of keeping humans in the loop.
Broadly agree, in that most safety research expands control over systems and our understanding of them, which can be abused by a bad actor.
This problem is encountered by for-profit companies, where profit is on the lines instead of catastrophe. They too have R&D departments and research directions which have the potential for misuse. However, this research is done inside a social environment (the company) where it is only explicitly used to make money.
To give a more concrete example, improving self-driving capabilities also allows the companies making the cars to intentionally make them run people down if they so wish. The more advanced the capabilities, the more precise they can be in deploying their pedestrian-killing machines onto the roads. However, we would never expect this to happen as this would clearly demolish the profitability of a company and result in the cessation of these activities.
AI safety research is not done in this kind of environment at present whatsoever. However, it does seem to me that these kinds of institutions that carefully vet research and products, only releasing them when they remain beneficial, are possible.
Really fascinating stuff! I have a (possibly answered) question about how using expert updates on other expert prediction might be valuable.
You discuss the negative impacts of allowing experts to aggregate themselves, or viewing one another's forecasts before initially submitting their own. Might there be value in allowing experts to submit multiple times, each time seeing the submitted predictions of a previous round? The final aggregation scheme would be able to not only assign a credence to each expert, but also gain a proxy for what credence the experts give to one another. In a more practical scenario where experts will talk if not collude, this might give better insight into how expert predictions are being created.
Thanks for taking the time to distill this work into a more approachable format - it certainly made the thesis more manageable!