Posts
Comments
I'm really excited about this, but not because of the distinction drawn between the shoggoth and the face. Applying a paraphraser such that the model's internal states are repeatedly swapped for states which we view as largely equivalent could be a large step towards interpretability.
This reminds me on the concept that CNNs work as they are equivariant under translation. The models can also be made (approximately) rotationally equivariant by applying all possible rotations for a given resolution to the training data. In doing this, we create a model which does not use absolute position or orientation which turn out to be excellent priors for generalisable image categorisation among many other tasks.
We decided upon transformations which our model should be invariant under, and applied them to training data. Doing the same to internal states periodically across the model's activations could force the internal messaging to not only be interpretable to us, but share our approach to irrelevant details.
There would likely be a negative alignment tax associated with this. However, this seems to be a broadly applicable approach to improving interpretability in other contexts.
Consider an image generation model which is configured to output progressively higher resolution and fleshed out generations. If asked to generate a dog, perhaps our 'paraphraser' could swap out parts of the background or change the absolute position of the dog in the image, making fewer changes as the product is filled in. If our model works well, this should give us greater diversity of outputs. If it is failing, a change of scenery from a grassy field to a city block could cause the model to entirely diverge from the prompt by generating a telephone box rather than a dog. This could show erroneous connections the model has drawn and unveil details about its functioning which impede proper generalisation.
I agree with your points about avoiding political polarisation and allowing people with different ideological positions to collaborate on alignment. I'm not sure about the idea that aligning to a single group's values (or to a coherent ideology) is technically easier than a more vague 'align to humanity's values' goal.
Groups rarely have clearly articulated ideologies - more like vibes which everyone normally gets behind. An alignment approach from clearly spelling out what you consider valuable doesn't seem likely to work. Looking to existing models which have been aligned to some degree through safety testing, the work doesn't take the form of injecting a clear structured value set. Instead, large numbers of people with differing opinions and world views continually correct the system until it generally behaves itself. This seems far more pluralistic than 'alignment to one group' suggests.
This comes with the caveat that these systems are created and safety tested by people with highly abnormal attitudes when compared to the rest of their species. But sourcing viewpoints from outside seems to be an organisational issue rather than a technical one.
One method of keeping humans in key industrial processes might be expanding credentialism. Individuals remaining control even when the majority of the thinking isn't done by them has always been a key part of any hierarchical organisation.
Legally speaking, certain key tasks can only be performed by qualified accountants, auditors, lawyers, doctors, elected officials and so on.
It would not be good for short term economic growth. However, legally requiring that certain tasks be performed by people with credentials machines are not eligible for might be a good (though absolutely not perfect) way of keeping humans in the loop.
Broadly agree, in that most safety research expands control over systems and our understanding of them, which can be abused by a bad actor.
This problem is encountered by for-profit companies, where profit is on the lines instead of catastrophe. They too have R&D departments and research directions which have the potential for misuse. However, this research is done inside a social environment (the company) where it is only explicitly used to make money.
To give a more concrete example, improving self-driving capabilities also allows the companies making the cars to intentionally make them run people down if they so wish. The more advanced the capabilities, the more precise they can be in deploying their pedestrian-killing machines onto the roads. However, we would never expect this to happen as this would clearly demolish the profitability of a company and result in the cessation of these activities.
AI safety research is not done in this kind of environment at present whatsoever. However, it does seem to me that these kinds of institutions that carefully vet research and products, only releasing them when they remain beneficial, are possible.
Really fascinating stuff! I have a (possibly answered) question about how using expert updates on other expert prediction might be valuable.
You discuss the negative impacts of allowing experts to aggregate themselves, or viewing one another's forecasts before initially submitting their own. Might there be value in allowing experts to submit multiple times, each time seeing the submitted predictions of a previous round? The final aggregation scheme would be able to not only assign a credence to each expert, but also gain a proxy for what credence the experts give to one another. In a more practical scenario where experts will talk if not collude, this might give better insight into how expert predictions are being created.
Thanks for taking the time to distill this work into a more approachable format - it certainly made the thesis more manageable!