Posts

Inner Alignment via Superpowers 2022-08-30T20:01:52.129Z
Finding Goals in the World Model 2022-08-22T18:06:48.213Z
The Core of the Alignment Problem is... 2022-08-17T20:07:35.157Z
Project proposal: Testing the IBP definition of agent 2022-08-09T01:09:37.687Z
Translating between Latent Spaces 2022-07-30T03:25:06.935Z
Formalizing Deception 2022-06-26T17:39:01.390Z

Comments

Comment by JamesH (AtlasOfCharts) on Project proposal: Testing the IBP definition of agent · 2022-08-09T18:27:36.488Z · LW · GW

The way we see this project going concretely looks something like:

First things first, we want to get a good enough theoretical background of IBP. This will ultimately result in something like a distillation of IBP that we will use as reference, and hope others will get a lot of use from.

In this process, we will be doing most of our testing in a theoretical framework. That is to say, we will be constructing model agents and seeing how InfraBayesian Physicalism actually deals with these in theory, whether it breaks down at any stage (as judged by us), and if so whether we can fix or avoid those problems somehow.

What comes after this, as we see it at the moment, is trying to implement the principles of InfraBayesian Physicalism in a real-life, honest-to-god, Inverse Reinforcment Learning proposal. We think IBP stands a good chance of being able to patch some of the largest problems in IRL, which should ultimately be demonstrable by actually making an IRL proposal that works robustly. (When this inevitably fails the first few times, we will probably return to step 1, having gained useful insights, and iterate).