jono

Posts
Comments

Posts

Jono's Shortform 2025-01-10T11:07:05.462Z

Is anyone developing optimisation-robust interpretability methods? 2024-06-11T13:14:51.018Z

Closed-Source Evaluations 2024-06-08T14:18:40.800Z

AI demands unprecedented reliability 2024-01-09T16:30:12.095Z

Comments

Comment by Jono (lw-user0246) on AI Safety Memes Wiki · 2025-04-19T11:03:15.822Z · LW · GW

Very cool, I'm not seeing a table of contents on aisafety.info however.

Comment by Jono (lw-user0246) on Jono's Shortform · 2025-04-13T09:13:17.635Z · LW · GW

Thank you very much.
I imagined the forecasting AI to not be smart enough to be able to simulate a tiler that knows it is being simulated by us.
Perhaps that constraint is so large that the forecasts cannot be reliable.

Comment by Jono (lw-user0246) on Jono's Shortform · 2025-04-03T14:09:02.502Z · LW · GW

Does this help outer alignment?

Goal: tile the universe with niceness, without knowing what niceness is.

Method

We create:
- a bunch of formulations of what niceness is.
- a tiling AI, that given some description of niceness, tiles the universe with it.
- a forecasting AI, that given a formulation of niceness, a description of the tiling AI, a description of the universe and some coordinates in the universe, generates a prediction of what the part of the universe at the coordinates looks like after the tiling AI has tiled it with the formulation of niceness.

Following that, we feed our formulations of niceness into the forecasting AI, randomly sample some coordinates and evaluate whether the resulting predictions look nice.
From this we infer which formulations of niceness are truly nice.

Weaknesses:
- Can we recognize utopia by randomly sampled predictions about parts of it?
- Our forecasting AI is magnitudes weaker than the tiling AI. Can formulations of niceness turn perverse when a smarter agent optimizes for them?

Strengths:
- Less need to solve the ELK problem.
- We have multiple tries at solving outer alignment.

Comment by Jono (lw-user0246) on Finishing The SB-1047 Documentary In 6 Weeks · 2025-03-25T12:26:33.117Z · LW · GW

I have looked around a bit and have not seen any updates since November, which estimated this be finished in early February.
Could you give another update, or link a more recent one if it exists?

Comment by Jono (lw-user0246) on Jono's Shortform · 2025-01-10T11:07:05.790Z · LW · GW

P(doom) can be approximately measured.
If reality fluid describes the territory well, we should be able to see close worlds that already died off.

For nuclear war we have some examples.
We can estimate the odds that the Cuban missile crisis and Petrov's decision went badly. If we accept that luck was a huge factor in us surviving those events (or not encountering events like it), we can see how unlikely our current world is to still live.

A high P(doom) implies that we are about to (or already did) encounter some very unlikely events that worked out suspiciously well for our survival. I don't know how public a registry of events like this should be, but it should exist.

Our self-reporting murderers or murder-witnesses should be extraordinarily well protected from leaks however, which in part seems like a software question.

Yes, this seems unlikely to happen, but again if your P(doom) is high, then we are only to survive in unlikely worlds. Working on this, to me, seems dignified: a way to make those unlikely worlds a bit less unlikely.

Comment by Jono (lw-user0246) on What should I do? (long term plan about starting an AI lab) · 2024-06-11T13:38:25.742Z · LW · GW

I don't know if you have already, but this might be the time to take a long and hard look at the probblem and consider whether deep learning is the key to solving it.

What is the problem?

reckless unilateralism? -> go work for policy or chip manufacturing
inabillity to specify human values? -> that problem looks not DL at all to me
powerful hackers stealing all the proto-AGIs in the next 4 years? -> go cybersec
deception? -> (why focus there? why make an AI that might deceive you in the first place?) but that's pretty ML, though I'm not sure interp is the way to go there
corrigibility? -> might be ML, though I'm not sure all theoretical squiggles are ironed out yet
OOD behavior? -> probably ML
multi-agent dynamics? -> probably ML

At the very least you ought to have a clear output channel if you're going to work with hazardous technology. Do you have the safety-mindset that prevents you from having you dual-use tech on the streets? You're probably familiar with the abysmal safety / capabilities ratio of people working in the field, any tech that helps safety as much as capability, will therefore in practice help capability more, if you don't distribute it carefully.

I personally would want some organisation to step up to become the keeper of secrets. I'd want them to just go all-out on cybersec, have a web of trust and basically be the solution to the unilateralists curse. That's not ML though.

I think this problem has a large ML-part to it, but the problem is being tackled nearly-solely by ML people. I think whatever part of the problem can be tackled with ML, won't necessarily benefit by having more ML people on it.

Comment by Jono (lw-user0246) on Closed-Source Evaluations · 2024-06-08T19:46:39.309Z · LW · GW

agreed

Comment by Jono (lw-user0246) on Shallow review of live agendas in alignment & safety · 2023-11-29T09:51:09.523Z · LW · GW

ai-plans.com aims to collect research agendas and have people comment on their strengths and vulnerabilities. The discord also occasionally hosts a critique-a-ton, where people discuss specific agendas.

Comment by Jono (lw-user0246) on AI as a science, and three obstacles to alignment strategies · 2023-10-27T09:47:25.739Z · LW · GW

We do not know, that is the relevant problem.

Looking at the output of a black box is insufficient. You can only know by putting the black box in power, or by deeply understanding it.
Humans are born into a world with others in power, so we know that most humans care about each other without knowing why.
AI has no history of demonstrating friendliness in the only circumstances where that can be provably found. We can only know in advance by way of thorough understanding.

A strong theory about AI internals should come first. Refuting Yudkowsky's theory about how it might go wrong is irrelevant.

Comment by Jono (lw-user0246) on The Löbian Obstacle, And Why You Should Care · 2023-09-08T08:28:44.190Z · LW · GW

Layman here 👋
Iiuc we cannot trust the proof of an unaligned simulacra's suggestion because if it is smarter than us.
Would that be a non-issue if verifying the proof is easier than making it?
If we can know how hard it is to verify a proof without verifying, then we can find a safe protocol for communicating with this simulacra. Is this possible?

Comment by lw-user0246 on [deleted post] 2022-09-29T15:10:35.648Z

it might be the case that any kind of meaningful values would be reasonably encodable as answers to the question "what next set of MPIs should be instantiated?"

What examples of (meaningless) values are not answers to "What next set of MPIs should be instantiated?"

Comment by Jono (lw-user0246) on All AGI safety questions welcome (especially basic ones) [Sept 2022] · 2022-09-08T16:19:58.160Z · LW · GW

What does our world (a decade) after the employment of a successfully aligned AGI look like?

Comment by Jono (lw-user0246) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-10T06:50:48.471Z · LW · GW

Thank you plex, I was not aware of this wiki.
The pitch is nice, I'll incorporate it.

Comment by Jono (lw-user0246) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-10T06:40:22.981Z · LW · GW

Why do I care if the people around me care about AI risk?

1. when AI is going to rule we'd like the people to somehow have some power I reckon.
I mean creating any superintelligence is a powergrab. Making one in secret is quite hostile, shouldn't people get a say or at least insight in what their future holds?

2. Nobody still really knows what we'd like the superint to do. I think an ML researcher is as capable of voicing their desires for the future as an artist. The field surely can benefit from interdisciplinary approaches.

3. As with nuclear war, I'm sure politicians will care more when the people care more. AI governance is a big point. Convincing AI devs to not make the superint seems easier when a big percentage of humanity is pressuring them not to do it.

4. Maybe this also extends to international relations. Seeing that the people of a democratic country care about the safety, makes the ventures from that country seem more reliable.

5. I get bummed out when nobody knows what I'm talking about.

Comment by Jono (lw-user0246) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-09T13:31:16.872Z · LW · GW

I'm concerned with the ethics.
Is it wrong to doom speak to strangers? Is that the most effective thing here? I'd be lying if I said I was fine, but would it be best to tell them I'm "mildly concerned"?

How do convey these grave emotions I have while maximally getting the people around me to care about mitigating AI risk?

Should I compromise on truth and downplay my concerns if that will get someone to care more? Should I expect people to be more receptive to the message of AI risk if I'm mild about it?

Comment by Jono (lw-user0246) on AGI Safety FAQ / all-dumb-questions-allowed thread · 2022-06-09T07:05:28.624Z · LW · GW

What do I tell the people who I know but can't spend lots of time with?

Clarification: How do I get relative strangers who converse with me IRL to maximally care about the dangers of AI?

Do I downplay my concerns such that they don't think I'm crazy?
Do I mention it every time I see them to make sure they don't forget?
Do I tolerate third parties butting in and making wrong statements?
Do I tell them to read up on it and pester them on whether they read it already?
Do I never mention it to laymen to avoid them propagating wrong memes?
Do I seek out and approach people who could be useful to the field or should I just forward or mention them to someone who can give the topic a better first impression than me?

User info

Posts

Comments