Jack O'Brien's Shortform

post by Jack O'Brien (jack-o-brien) · 2022-12-01T08:58:32.177Z · LW · GW · 5 comments

Contents

5 comments

5 comments

Comments sorted by top scores.

comment by Jack O'Brien (jack-o-brien) · 2024-08-25T03:01:15.019Z · LW(p) · GW(p)

** Progress Report: AI Safety Fundamentals Project ** This is a public space for me to keep updated on my AI safety fundamentals project. The project will take 4 weeks. My goal is to stay lean and limit my scope so I can actually finish on time. I aim to update this post at least once per week with my updates, but maybe more often.

Overall, I want to work on agent foundations and the theory behind AI alignment agendas. One stepping point for this is Selection theorems; a research program to find justifications that a given training process will result in a given agent property.

My plan for the agisf project: literature review on selection theorems. Take a whole load of concepts / blog posts, read them, riff on them if i feel like it. At least write a 1 paragraph summary of each post im intrerested in. List of posts:

  • John's original posts on Selection theorems, and Adam Khoja's distillation of it.
  • Scott garrabrant's stuff on geometric rationality
  • Coherence theorems for utility theory.
  • Evolutionary biology shallow dive and explanation of price and fishers equations.
  • Maybe some stuff by Thane Ruthenis.
  • Some content from Jaynes' probability theory about bayesian vs frequentism
  • Power seeking is instrumentally convergent in MDPs.
  • ??? more examples to come once i read john's original post.

TODO:

  • Make an initial lesswrong progress report.
  • Make a list of things to read.
  • Make a git repo on my pc with markdown and mathjax support. In the initial document, populate it with the list of things to read. For each thing I read, remove it from the TODO list and put its summary in the main body of the blog post. When I am done, any posts still left on the TODO list will get formatted and added as an 'additional reading' section.
Replies from: jack-o-brien
comment by Jack O'Brien (jack-o-brien) · 2024-09-08T09:17:42.595Z · LW(p) · GW(p)

Well, haven't got much done in the last 2 weeks. Life has gotten in the way, and in the times where I thought I actually had the time and headspace to work on the project, things happened like my shoulder got injured playing sport, and my laptop mysteriously died.

But I have managed to create a github repo, and read the original posts on selection theorems. My list of selection theorems to summarize has grown. Check out the github page: https://github.com/jack-obrien/selection-theorems-review

Tonight I will try to do at least an hour of solid work on it. I want to summarize the idea of selection theorems, and sumarize the good regulator theorem, and start reading the next post (probably Turner's post on power seeking)

Replies from: jack-o-brien
comment by Jack O'Brien (jack-o-brien) · 2024-09-15T05:47:41.112Z · LW(p) · GW(p)

Ummmmm yeah what have I done so far. I didn't really get any solid work done this week either. I have decided to extend the project by another two weeks with the other two people involved - we have all been pretty preoccupied with life. Last week on sunday night i didn't really do a solid hour of work. I did manage to summarise the concept of selection theorems and think about the agent type signature - a concept i will be referring to throughout the post, super fundamental. Tonight I will hopefully actually meet with my group. I wanna do like half an hour of work before, and a little bit after too. I want to summarise the good regulator theorem this week as well as turner's post on power seeking.

comment by Jack O'Brien (jack-o-brien) · 2022-12-01T08:58:32.415Z · LW(p) · GW(p)

Let's be optimistic and prove that an agentic AI will be beneficial for the long-term future of humanity. We probably need to prove these 3 premises:

Premise 1:  Training story X will create an AI model which approximates agent formalism A
Premise 2: Agent formalism A is computable and has a set of alignment properties P
Premise 3: An AI with a set of alignment properties P will be beneficial for the long-term future.

Aaand so far I'm not happy with our answers to any of these.

Replies from: isabella-barber
comment by Isabella Barber (isabella-barber) · 2022-12-01T10:02:33.298Z · LW(p) · GW(p)

maybe there is no set of properties p that can produce alignment hmm