Jack O'Brien's Shortform
post by Jack O'Brien (jack-o-brien) · 2022-12-01T08:58:32.177Z · LW · GW · 5 commentsContents
5 comments
5 comments
Comments sorted by top scores.
comment by Jack O'Brien (jack-o-brien) · 2024-08-25T03:01:15.019Z · LW(p) · GW(p)
** Progress Report: AI Safety Fundamentals Project ** This is a public space for me to keep updated on my AI safety fundamentals project. The project will take 4 weeks. My goal is to stay lean and limit my scope so I can actually finish on time. I aim to update this post at least once per week with my updates, but maybe more often.
Overall, I want to work on agent foundations and the theory behind AI alignment agendas. One stepping point for this is Selection theorems; a research program to find justifications that a given training process will result in a given agent property.
My plan for the agisf project: literature review on selection theorems. Take a whole load of concepts / blog posts, read them, riff on them if i feel like it. At least write a 1 paragraph summary of each post im intrerested in. List of posts:
- John's original posts on Selection theorems, and Adam Khoja's distillation of it.
- Scott garrabrant's stuff on geometric rationality
- Coherence theorems for utility theory.
- Evolutionary biology shallow dive and explanation of price and fishers equations.
- Maybe some stuff by Thane Ruthenis.
- Some content from Jaynes' probability theory about bayesian vs frequentism
- Power seeking is instrumentally convergent in MDPs.
- ??? more examples to come once i read john's original post.
TODO:
- Make an initial lesswrong progress report.
- Make a list of things to read.
- Make a git repo on my pc with markdown and mathjax support. In the initial document, populate it with the list of things to read. For each thing I read, remove it from the TODO list and put its summary in the main body of the blog post. When I am done, any posts still left on the TODO list will get formatted and added as an 'additional reading' section.
↑ comment by Jack O'Brien (jack-o-brien) · 2024-09-08T09:17:42.595Z · LW(p) · GW(p)
Well, haven't got much done in the last 2 weeks. Life has gotten in the way, and in the times where I thought I actually had the time and headspace to work on the project, things happened like my shoulder got injured playing sport, and my laptop mysteriously died.
But I have managed to create a github repo, and read the original posts on selection theorems. My list of selection theorems to summarize has grown. Check out the github page: https://github.com/jack-obrien/selection-theorems-review
Tonight I will try to do at least an hour of solid work on it. I want to summarize the idea of selection theorems, and sumarize the good regulator theorem, and start reading the next post (probably Turner's post on power seeking)
Replies from: jack-o-brien↑ comment by Jack O'Brien (jack-o-brien) · 2024-09-15T05:47:41.112Z · LW(p) · GW(p)
Ummmmm yeah what have I done so far. I didn't really get any solid work done this week either. I have decided to extend the project by another two weeks with the other two people involved - we have all been pretty preoccupied with life. Last week on sunday night i didn't really do a solid hour of work. I did manage to summarise the concept of selection theorems and think about the agent type signature - a concept i will be referring to throughout the post, super fundamental. Tonight I will hopefully actually meet with my group. I wanna do like half an hour of work before, and a little bit after too. I want to summarise the good regulator theorem this week as well as turner's post on power seeking.
comment by Jack O'Brien (jack-o-brien) · 2022-12-01T08:58:32.415Z · LW(p) · GW(p)
Let's be optimistic and prove that an agentic AI will be beneficial for the long-term future of humanity. We probably need to prove these 3 premises:
Premise 1: Training story X will create an AI model which approximates agent formalism A
Premise 2: Agent formalism A is computable and has a set of alignment properties P
Premise 3: An AI with a set of alignment properties P will be beneficial for the long-term future.
Aaand so far I'm not happy with our answers to any of these.
↑ comment by Isabella Barber (isabella-barber) · 2022-12-01T10:02:33.298Z · LW(p) · GW(p)
maybe there is no set of properties p that can produce alignment hmm