Posts

Paul W's Shortform 2025-04-15T15:10:35.296Z
Why I find Davidad's plan interesting 2024-05-20T08:13:15.950Z

Comments

Comment by Paul W on Paul W's Shortform · 2025-04-15T15:10:35.296Z · LW · GW

The Von Neumann-Morgenstern paradigm allows for binary utility functions, i.e. functions that are equal to 1 on some event/(measurable) set of outcomes, and to 0 on the complement. Said event could be, for instance "no global catastrophe for humanity in time period X".
Of course, you can implement some form of deontology by multiplying such a binary utility function with something like exp(- bad actions you take).

Any thoughts on this observation?
 

Comment by Paul W on What convincing warning shot could help prevent extinction from AI? · 2025-04-05T15:53:59.966Z · LW · GW

When you say "maybe we should be assembling like minded and smart people [...]", do you mean "maybe"? Or do you mean "Yes, we should definitely do that ASAP"?

Comment by Paul W on Conceptual Rounding Errors · 2025-03-30T10:44:54.809Z · LW · GW

Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along.

Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it.

 

Some questions regarding these contexts:

-Is it true that you can deduce that "not much progress" is being made? In (pure) maths, it is sometimes very useful to be able to connect to points of view/notions (e.g. (co)homological theories, to name the most obvious example that comes to mind).

-What is the goal of such interactions? Is it truly to point out relevant related work? To dismiss other people's ideas for {political/tribal/ego-related} motives? Other?

As for possible fixes:

-Maintain a collective {compendium/graph/whatever data structure is relevant} of important concepts, with precise enough definitions, and comparison information (examples and/or theoretical arguments) between similar, but not identical, ideas.

Or rather: acknowledging that the AI Safety community(ies) is/are terrible at coordination, devise a way of combining/merging such {compendiums/graphs/whatever}, for it is unlikely that only one emerges...

Comment by Paul W on Good Research Takes are Not Sufficient for Good Strategic Takes · 2025-03-28T10:04:33.823Z · LW · GW

Strong upvote. Slightly worried by the fact that this wasn't written, in some form, earlier (maybe I missed a similar older post?)

I think we[1] can, and should, go even further:

 

-Find a systematic/methodical way of identifying which people are really good at strategic thinking, and help them use their skills in relevant work; maybe try to hire from outside the usual recruitment pools. 

If deemed feasible (in a short enough amount of time): train some people mainly on strategy, so as to get a supply of better strategists.

-Encourage people to state their incompetence in some domains (except maybe in cases where it makes for bad PR) / embrace the idea of specialization and division of labour more: maybe high-level strategists don't need as much expertise on the technical details, only the ability to see which phenomena matter (assuming domain experts are able to communicate well enough)

 

  1. ^

    say, the people who care about preventing catastrophic events, in a broad sense

Comment by Paul W on Elicitation for Modeling Transformative AI Risks · 2025-03-28T09:33:25.250Z · LW · GW

Hi! 

Have you heard of the ModelCollab  and CatColab projects ? It seems that there is an interesting overlap with what you want to do!

More generally, people at the Topos Institute work on related questions, of collaborative modelling and collective intelligence: 

 

https://topos.institute/work/collective-intelligence/

https://topos.institute/work/collaborative-modelling/

https://www.localcharts.org/t/positive-impact-of-algebraicjulia/6643

There's a website for sharing world-modelling ideas, run by Owen Lynch (who works at Topos UK) 

https://www.localcharts.org/t/localcharts-is-live/5714


For instance, they have a paper on task-delegation

 

 

Their work uses somewhat advanced maths, but I think it is justified by the ambition: to develop general tools for creating and combining models. They seem to make an effort to popularise these, so that non-mathematicians can get something out of their work.

Comment by Paul W on Emergence, The Blind Spot of GenAI Interpretability? · 2024-08-17T17:51:20.483Z · LW · GW

Are you saying that holistic/higher-level approaches can be useful because they are very likely to be more computationally efficient/actually fit inside human brains/ do not require as much data ? 
Is that the main point, or did I miss something ?

Comment by Paul W on Natural Latents: The Concepts · 2024-08-02T19:03:09.250Z · LW · GW

Hello !
These ideas seem interesting, but there's something that disturbs me: in the coin flip example, how is 3 fundamentally different from 1000 ? The way I see it, the only mathematical difference is that your "bounds" (whatever that means) are simply much worse in the case with 3 coins. Of course, I think I understand why humans/agents would want to say "the case with 3 flips is different from that with 1000", but the mathematics seem similar to me.
Am I missing something ?

Comment by Paul W on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2024-05-18T16:04:12.025Z · LW · GW

Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?

As far as I know (I'm not an expert), such absolute guarantees are too hard right now, especially if the AI you're trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I'm not entirely sure I got what you meant by "a guaranteed no-zero-day evaluation and deployment codebase". Would you mind explaining more ?
 

"Or is the claim that it's feasible to build a conservative world model that tells you "maybe a zero-day" very quickly once you start doing things not explicitly within a dumb world model?" 

I think that's closer to the idea: you {reject and penalize, during training} as soon as the AI tries something that might be "exploiting a zero-day", in the sense that the world-model can't rule out this possibility with high confidence[1]. That way, the training process is expected to reward simpler, more easily verified actions.


Then, a key question is "what else you do want from your AI ?": of course, it is supposed to perform critical tasks, not just "let you see what program is running"[2], so there is tension between the various specifications you enter. The question of how far you can actually go, how much you can actually ask for, is both crucial, and wide open, as far as I can tell.

  1. ^

    Some of the uncertainty lies in how accurate and how conservative the world-model is; you won't get a "100% guarantee" anyway, especially since you're only aiming for probabilistic bounds within the model.

  2. ^

    Otherwise, a sponge would do.

Comment by Paul W on Davidad's Bold Plan for Alignment: An In-Depth Explanation · 2024-05-16T08:25:03.873Z · LW · GW

I believe that the current trends for formal verification, say, of traditional programs or small neural networks, are more about conservative overapproximations (called abstract interpretations). You might want to have a look at this: https://caterinaurban.github.io/pdf/survey.pdf 
To be more precise, it appears that so-called "incomplete formal methods" (3.1.1.2 in the survey I linked) are more computationally efficient, even though they can produce false negatives.
Does that answer your question ?