Jono's Shortform

post by Jono (lw-user0246) · 2025-01-10T11:07:05.462Z · LW · GW · 4 comments

Contents

4 comments

4 comments

Comments sorted by top scores.

comment by Jono (lw-user0246) · 2025-04-03T14:09:02.502Z · LW(p) · GW(p)

Does this help outer alignment?

Goal: tile the universe with niceness, without knowing what niceness is.

Method

We create:
- a bunch of formulations of what niceness is.
- a tiling AI, that given some description of niceness, tiles the universe with it.
- a forecasting AI, that given a formulation of niceness, a description of the tiling AI, a description of the universe and some coordinates in the universe, generates a prediction of what the part of the universe at the coordinates looks like after the tiling AI has tiled it with the formulation of niceness.

Following that, we feed our formulations of niceness into the forecasting AI, randomly sample some coordinates and evaluate whether the resulting predictions look nice.
From this we infer which formulations of niceness are truly nice.

Weaknesses:
- Can we recognize utopia by randomly sampled predictions about parts of it?
- Our forecasting AI is magnitudes weaker than the tiling AI. Can formulations of niceness turn perverse when a smarter agent optimizes for them?

Strengths:
- Less need to solve the ELK problem.
- We have multiple tries at solving outer alignment.

Replies from: Viliam
comment by Viliam · 2025-04-04T07:21:29.904Z · LW(p) · GW(p)

I like the relative simplicity of this approach, but yeah, there is a risk that a tiling agent would produce (a more sophisticated version of) humans that have a permanent smile on their faces but feel horrible pain inside. Something bad that would look convincingly good at first sight, enough to fool the forecasting AI, or rather enough to fool the people who are programming and testing the forecasting AI.

Replies from: lw-user0246
comment by Jono (lw-user0246) · 2025-04-13T09:13:17.635Z · LW(p) · GW(p)

Thank you very much.
I imagined the forecasting AI to not be smart enough to be able to simulate a tiler that knows it is being simulated by us.
Perhaps that constraint is so large that the forecasts cannot be reliable.

comment by Jono (lw-user0246) · 2025-01-10T11:07:05.790Z · LW(p) · GW(p)

P(doom) can be approximately measured. 
If reality fluid describes the territory well, we should be able to see close worlds that already died off.

For nuclear war we have some examples.
We can estimate the odds that the Cuban missile crisis and Petrov's decision went badly. If we accept that luck was a huge factor in us surviving those events (or not encountering events like it), we can see how unlikely our current world is to still live.

A high P(doom) implies that we are about to (or already did) encounter some very unlikely events that worked out suspiciously well for our survival. I don't know how public a registry of events like this should be, but it should exist.

Our self-reporting murderers or murder-witnesses should be extraordinarily well protected from leaks however, which in part seems like a software question.

Yes, this seems unlikely to happen, but again if your P(doom) is high, then we are only to survive in unlikely worlds. Working on this, to me, seems dignified: a way to make those unlikely worlds a bit less unlikely.