Introduction and current research agendapost by quila · 2023-11-20T12:42:48.594Z · LW · GW · 1 comments
quila, and have been studying alignment for the past year.
After first reading the sequences [? · GW] as others advised, I have been poring over alignment literature every day since late 2022. I've also been discussing subjects and ideas with other alignment researchers via discord, but so far have not shared theory to the broader alignment community.
I think I'm ready to start doing that, so here's a post contextualizing my agenda.
First, I think superintelligence will probably arrive soon. In that case, we may not have enough time to solve alignment from within the 'old framework' [LW · GW] of highly optimized agents. Instead, my focus is towards a different (but still pivotal) goal: to enable the safe use of unaligned systems to steer reality.
I hope for this to bring Earth to a point where things are roughly okay, and where we have more time to solve the hard problems of aligning powerful agents.
Without this frame, my future posts may at first appear to some as ill-focused on problems outside that scope, such as myopia [? · GW], performative prediction [LW · GW], and other concepts yet to be named. I hope that when read with the above focus in mind, there will be a clear connection to this longer-term plan.
Second, I expect superintelligent predictive models to be creatable in the future. Although current predictive models have promising properties [LW · GW], catastrophic failure modes are likely to arise at higher capability levels (e.g as detailed in 'conditioning predictive models' [? · GW]). My hope here is to develop methods which bridge the safety gap between current and superintelligent models, leaving no free variables whose optimization would effect the world in unexpected ways.
Lastly, a note on why I care to begin with.
I suffered a lot as a human, and came to feel it is dire to minimize suffering in other beings (human, animal, artificial). Solving alignment seems to be the best way to do this in our lightcone and beyond.
There has been some discussion [LW(p) · GW(p)] about how future value should be distributed. Although I do have some ideals for what a good universe would look like, they are minor in comparison to my opposition to suffering.
Therefore, I have few worries about issues of who the eventual ASI is aligned to, or whether they 'follow through on the LDT handshake [LW(p) · GW(p)]'. As long as the resulting world minimizes the occurrence of devastating forms of suffering, I will be mostly satisfied.
If you're interested in working together, please reach out to me on discord (username:
Comments sorted by top scores.