Posts

AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0 2024-07-06T11:34:57.227Z
Inner Alignment via Superpowers 2022-08-30T20:01:52.129Z
Finding Goals in the World Model 2022-08-22T18:06:48.213Z
The Core of the Alignment Problem is... 2022-08-17T20:07:35.157Z
Project proposal: Testing the IBP definition of agent 2022-08-09T01:09:37.687Z
Translating between Latent Spaces 2022-07-30T03:25:06.935Z
Formalizing Deception 2022-06-26T17:39:01.390Z

Comments

Comment by JamesH (AtlasOfCharts) on Singular learning theory: exercises · 2024-09-22T11:06:16.790Z · LW · GW

I think there's a mistake in 17: \sin(x) is not a diffeomorphism between (-\pi,\pi) and (-1,1) (since it is e.g. not bijective between these domains). Either you mean sin(x/2) or the interval bounds should be (-\pi/2, \pi/2)

Comment by JamesH (AtlasOfCharts) on AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0 · 2024-07-10T09:08:58.997Z · LW · GW

ARENA might end up teaching this person some mech-interp methods they haven't seen before, although it sounds like they would be more than capable of self-teaching any mech-interp. The other potential value-add for your acquaintance would be if they wanted to improve their RL or Evals skills, and have a week to conduct a capstone project with advisors. If they were mostly aiming to improve their mech-interp ability by doing ARENA, there would probably be better ways to spend their time.

Comment by JamesH (AtlasOfCharts) on Project proposal: Testing the IBP definition of agent · 2022-08-09T18:27:36.488Z · LW · GW

The way we see this project going concretely looks something like:

First things first, we want to get a good enough theoretical background of IBP. This will ultimately result in something like a distillation of IBP that we will use as reference, and hope others will get a lot of use from.

In this process, we will be doing most of our testing in a theoretical framework. That is to say, we will be constructing model agents and seeing how InfraBayesian Physicalism actually deals with these in theory, whether it breaks down at any stage (as judged by us), and if so whether we can fix or avoid those problems somehow.

What comes after this, as we see it at the moment, is trying to implement the principles of InfraBayesian Physicalism in a real-life, honest-to-god, Inverse Reinforcment Learning proposal. We think IBP stands a good chance of being able to patch some of the largest problems in IRL, which should ultimately be demonstrable by actually making an IRL proposal that works robustly. (When this inevitably fails the first few times, we will probably return to step 1, having gained useful insights, and iterate).