Appendices to the live agendas
post by technicalities, Stag · 2023-11-27T11:10:32.187Z · LW · GW · 4 commentsContents
Appendix: Prior enumerations Appendix: Graveyard Appendix: Biology for AI alignment Human enhancement Merging As alignment aid Appendix: Research support orgs Appendix: Meta, mysteries, more None 4 comments
Lists cut from our main post [LW · GW], in a token gesture toward readability.
We list past reviews of alignment work, ideas which seem to be dead, the cool but neglected neuroscience / biology approach, various orgs which don't seem to have any agenda, and a bunch of things which don't fit elsewhere.
Appendix: Prior enumerations
- Everitt et al (2018)
- Ji (2023)
- Soares (2023) [LW · GW]
- Gleave and MacLean (2023) [LW · GW]
- Krakovna and Shah on Deepmind (2023) [AF · GW]
- AI Plans (2023), mostly irrelevant
- Macdermott (2023) [LW · GW]
- Kirchner et al (2022): unsupervised analysis [AF · GW]
- Krakovna (2022), Paradigms of AI alignment
- Hubinger (2020)
- Perret (2020) [AF · GW]
- Nanda (2021) [LW · GW]
- Larsen (2022) [LW · GW]
- Zoellner (2022) [LW · GW]
- McDougall (2022)
- Sharkey et al (2022) [AF · GW] on interp
- Koch (2020) [AF · GW]
- Critch (2020) [LW · GW]
- Hubinger on types of interpretability [LW · GW] (2022)
- Tai_safety_bibliography (2021)
- Akash and Larsen (2022) [LW · GW]
- Karnofsky prosaic plan [LW · GW] (2022)
- Steinhardt (2019)
- Russell (2016)
- FLI (2017)
- Shah (2020) [AF · GW]
- things which claim to be agendas [? · GW]
- Tekofsky listing indies (2023)
- This thing called boundaries (2023) [LW · GW]
- Sharkey et al (2022), mech interp [AF · GW]
- FLI Governance scorecard
- The money
Appendix: Graveyard
- Ambitious value learning?
- MIRI youngbloods (see Hebbar)
- JW selection theorems [? · GW]??
- Provably Beneficial Artificial Intelligence (but see Open Agency and Omohundro)
- HCH [? · GW] (see QACI)
- IDA → Critiques and recursive reward modelling
- Debate is now called Critiques and ERO
- Market-making [AF · GW] (Hubinger)
- Logical inductors
- Conditioning Predictive Models: Risks and Strategies?
- Impact measures [? · GW], conservative agency, side effects → “power aversion”
- Acceptability Verification [LW · GW]
- Quantilizers
- Redwood [LW · GW] interp?
- AI Safety Hub
- Enabling Robots to Communicate their Objectives (early stage interp?)
- Aligning narrowly superhuman models [LW · GW] (Cotra idea; tiny followup [AF · GW]; lives on as scalable oversight?)
- automation of semantic interpretability
- i.e. automatically proposing hypotheses instead of just automatically verifying them
- Oracle AI [? · GW] is not dead so much as ~everything LLM falls under its purview
- EleutherAI: #accelerating-alignment [LW · GW] - AI alignment assistants are live, but it doesn’t seem like EleutherAI is currently working on this
- Algorithm Distillation Interpretability (Levoso)
Appendix: Biology for AI alignment
Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.
Human enhancement
- One-sentence summary: maybe we can give people new sensory modalities, or much higher bandwidth for conceptual information, or much better idea generation, or direct interface with DL systems, or direct interface with sensors, or transfer learning, and maybe this would help. The old superbaby dream goes here I suppose.
- Theory of change: maybe this makes us better at alignment research
Merging
- One-sentence summary: maybe we can form networked societies of DL systems and brains
- Theory of change: maybe this lets us preserve some human values through bargaining or voting or weird politics.
- Cyborgism [LW · GW], Millidge [LW · GW], Dupuis [LW · GW]
As alignment aid
- One-sentence summary: maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively, maybe we can crack the true human reward function / social instincts and maybe adapt some of them for AGI.
- Theory of change: as you’d guess
- Some names: Byrnes [LW · GW], Cvitkovic, Foresight’s BCI, Also (list from Byrnes): Eli Sennesh, Adam Safron, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin
Appendix: Research support orgs
One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.
- One-sentence summary: support researchers making differential progress in cooperative AI (eg precommitment mechanisms that can’t be used to make threats)
- Some names: Lewis Hammond
- Estimated # FTEs: 3
- Some outputs in 2023: Neurips contest, summer school
- Funded by: Polaris Ventures
- Critiques:
- Funded by: ?
- Trustworthy command, closure, opsec, common good, alignment mindset: ?
- Resources: £2,423,943
- One-sentence summary: entrypoint for new researchers to test fit and meet collaborators. More recently focussed on a capabilities pause. Still going!
- Some names: Remmelt Ellen, Linda Linsefors
- Estimated # FTEs: 2
- Some outputs in 2023: tag [? · GW]
- Funded by: ?
- Critiques: ?
- Funded by: ?
- Trustworthy command, closure, opsec, common good, alignment mindset: ?
- Resources: ~$200,000
See also:
- FLI
- AISS
- PIBBSS
- Gladstone AI
- Apart
- Catalyze
- EffiSciences
- Students (Oxford, Harvard, Groningen, Bristol, Cambridge, Stanford, Delft, MIT)
Appendix: Meta, mysteries, more
- Metaphilosophy [LW · GW] (2023 [LW · GW]) – Wei Dai, probably <1 FTE
- Encultured - making a game
- Bricman: https://compphil.github.io/truth/
- Algorithmic Alignment
- McIlrath: side effects and human-aware AI
- how hard is alignment? [LW(p) · GW(p)]
- Intriguing
- The UK government now has an evals/alignment lab of sorts (Xander Davies, Nitarshan Rajkumar, Krueger, Rumman Chowdhury)
- The US government will soon have an evals/alignment lab of sorts, USAISI. (Elham Tabassi?)
- A Roadmap for Robust End-to-End Alignment (2018)
- Not theory but cool
- Greene Lab Cognitive Oversight? Closest they’ve got is AI governance and they only got $13,800
- GATO Framework seems to have a galaxy-brained global coordination and/or self-aligning AI angle
- Wentworth 2020 [? · GW] - a theory of abstraction suitable for embedded agency.
- https://www.aintelope.net/
- ARAAC - Make RL agents that can justify themselves (or other agents with their code as input) in human language at various levels of abstraction (AI apology, explainable RL for broad XAI)
- Modeling Cooperation
- Sandwiching (ex [LW · GW])
- Tripwires [? · GW]
- Meeseeks AI [LW · GW]
- Shutdown-seeking AI [LW · GW]
- Decision theory (Levinstein)
- CAIS: Machine ethics [? · GW] - Model intrinsic goods and normative factors, vs mere task preferences, as these will be relevant even under extreme world changes; helps us avoid proxy misspecification as well as value lock-in. Present state of research unknown (some relevant bits in RepEng).
- CAIS: Power Aversion [? · GW] - incentives to models to avoid gaining more power than necessary. Related to mild optimisation and powerseeking. Present state of research unknown.
4 comments
Comments sorted by top scores.
comment by Alex_Altair · 2023-11-28T18:09:57.659Z · LW(p) · GW(p)
Honestly this isn't that long, I might say to re-merge it with the main post. Normally I'm a huge proponent of breaking posts up smaller, but yours is literally trying to be an index, so breaking a piece off makes it harder to use.
Replies from: technicalities↑ comment by technicalities · 2023-11-29T09:31:29.689Z · LW(p) · GW(p)
yeah you're right
comment by Steven Byrnes (steve2152) · 2023-11-27T13:34:24.808Z · LW(p) · GW(p)
For what it’s worth, I am not doing (and have never done) any research remotely similar to your text “maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively”.
I have a concise and self-contained summary of my main research project here (Section 2) [LW · GW].
Replies from: technicalities↑ comment by technicalities · 2023-11-27T13:37:13.115Z · LW(p) · GW(p)
I care a lot! Will probably make a section for this in the main post under "Getting the model to learn what we want", thanks for the correction.