Posts

Which ML skills are useful for finding a new AIS research agenda? 2023-02-09T13:09:49.878Z
Will working here advance AGI? Help us not destroy the world! 2022-05-29T11:42:06.846Z
Working Out in VR Really Works 2022-04-03T18:42:37.514Z
Yonatan Cale's Shortform 2022-04-02T22:20:47.184Z
How to Make Your Article Change People's Minds or Actions? (Spoiler: Do User Testing Like a Startup Would) 2022-03-30T17:39:34.373Z
A Meditative Experience 2021-12-03T17:58:39.462Z

Comments

Comment by Yonatan Cale (yonatan-cale-1) on Should we exclude alignment research from LLM training datasets? · 2024-12-11T03:23:27.463Z · LW · GW

I'm not sure I'm imagining the same thing as you, but as a draft solution, how about a robots.txt?

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-12-09T23:46:22.289Z · LW · GW

TL;DR: point 3 is my main one.

 

1)

What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?

[I'm not sure why you're asking, maybe I'm missing something, but I'll answer]

For example, checking if human values are a "natural abstraction", or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution. 

I can make up more if that helps? anyway my point was just to say explicitly what parts I'm commenting on and why (in case I missed something)

 

2)

it seems like you think RLHF counts as an alignment technique

It's a candidate alignment technique.

RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI).

I'm not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people's opinions too (whether they do or don't believe that RLHF is a good alignment technique).

 

3)

because you have to somehow communicate to the AI system what you want it to do, and AI systems don't seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)

I would finetune the AI on objective outcomes like "fill this chest with gold" or "kill that creature [the dragon]" or "get 100 villagers in this area". I'd pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don't require the AI to understand human values or ideally anything about humans at all.

So I'd avoid finetuning it on things like "are other players having fun" or "build a house that would be functional for a typical person" or "is this waterfall pretty [subjectively, to a human]".

Does this distinction seem clear? useful?

This would let us test how some specific alignment technique (such as "RLHF that doesn't contain minecraft examples") generalizes to minecraft

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-12-09T23:08:51.593Z · LW · GW

If you talk about alignment evals for alignment that isn't naturally incentivized by profit-seeking activities, "stay within bounds" is of course less relevant.

Yes.

Also, I think "make sure Meth [or other] recipes are harder to get from an LLM than from the internet" is not solving a big important problem compared to x-risk, not that I'm against each person working on whatever they want. (I'm curious what you think but no pushback for working on something different from me)

 

 

one of the most generalizing and concrete works involves at every step maximizing how many choices the other players have (liberalist prior on CEV) to maximize the optional utility for humans.

This imo counts as a potential alignment technique (or a target for such a technique?) and I suggest we could test how well it works in minecraft. I can imagine it going very well or very poorly. wdyt?

 

In terms of "understanding the spirit of what we mean," it seems like there's near-zero designs that would work since a Minecraft eval would be blackbox anyways

I don't understand. Naively, seems to me like we could black-box observe whether the AI is doing things like "chop down the tree house" or not (?)

(clearly if you have visibility to the AI's actual goals and can compare them to human goals then you win and there's no need for any minecraft evals or most any other things, if that's what you mean)

Comment by Yonatan Cale (yonatan-cale-1) on Thoughts On (Solving) Deep Deception · 2024-12-09T22:59:45.559Z · LW · GW

Intuitively, this involves two components: the ability to robustly steer high-level structures like objectives, and something good to target at.

I agree.

But if we solve these two problems then I think you could go further and say we don't really need to care about deceptiveness at all. Our AI will just be aligned.

 

P.S

“Ah”, but straw-you says,

This made me laugh

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-29T13:57:06.560Z · LW · GW

My own pushback to minecraft alignment evals:

Mainly, minecraft isn't actually out of distribution, LLMs still probably have examples of nice / not-nice minecraft behaviour.

 

Next obvious thoughts:

  1. What game would be out of distribution (from an alignment perspective)?
  2. If minecraft wouldn't exist, would inventing it count as out of distribution?
    1. It has a similar experience to other "FPS" games (using a mouse + WASD). Would learning those be enough?
    2. Obviously, minecraft is somewhat out of distribution, to some degree
  3. Ideally we'd have a way to generate a game that is out of distribution to some degree that we choose
    1. "Do you want it to be 2x more out of distribution than minecraft? no problem".
    2. But having a game of random pixels doesn't count. We still want humans to have a ~clear[1] moral intuition about it.
  4. I'd be super excited to have research like "we trained our model on games up to level 3 out-of-distribution, and we got it to generalize up to level 6, but not 7. more research needed"
  1. ^

    Moral intuitions such as "don't chop down the tree house in an attempt to get wood", which is the toy example for alignment I'm using here.

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-29T13:52:50.535Z · LW · GW
Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-29T13:49:05.159Z · LW · GW

Related: Building a video game to test alignment (h/t @Crazytieguy )

https://www.lesswrong.com/posts/ALkH4o53ofm862vxc/announcing-encultured-ai-building-a-video-game

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-29T13:17:28.047Z · LW · GW

Thanks!

In the part you quoted - my main question would be "do you plan on giving the agent examples of good/bad norm following" (such as RLHFing it). If so - I think it would miss the point, because following those norms would become in-distribution, and so we wouldn't learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That's the main thing I think worth testing here. (do you agree? I can elaborate on why I think so)

If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we're on the same page!

 

Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy). 

My only comment there is that I'd try to not give the agent feedback about human values (like "is the waterfall pretty") but only about clearly defined objectives (like "did it kill the dragon"), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?

 

(I hope I didn't misunderstand something important in the article, feel free to correct me of course)

 

  1. ^

    Whatever "aligned" means. "other players have fun on this minecraft server" is one example.

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-29T12:02:37.496Z · LW · GW

:)

I don't think alignment KPIs like "stay within bounds" are relevant to alignment at all even as toy examples: because if so, then we could say for example that playing a packman maze game where you collect points is "capabilities", but adding enemies that you must avoid is "alignment". Do you agree that plitting it up that way wouldn't be interesting to alignment, and that this applies to "stay within bounds" (as potentially also being "part of the game")? Interested to hear where you disagree, if you do

 

Regarding 

Distribute resources fairly when working with other players

I think this pattern matches to a trolly problem or something, where there are clear tradeoffs and (given the AI is even trying), it could probably easily give an answer which is similarly controversial to an answer that a human would give. In other words, this seems in-distribution.

 

Understanding and optimizing for the utility of other players

This is the one I like - assuming it includes not-well-defined things like "help them have fun, don't hurt things they care about" and not only things like "maximize their gold".

It's clearly not a "in packman, avoid the enemies" thing.

It's a "do the AIs understand the spirit of what we mean" thing.

(does this resonate with you as an important distinction?)

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-26T14:25:24.071Z · LW · GW

I agree.

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-26T14:21:56.357Z · LW · GW

This all sounds pretty in-distribution for an LLM, and also like it avoids problems like "maybe thinking in different abstractions" [minecraft isn't amazing at this either, but at least has a bit], "having the AI act/think way faster than a human", "having the AI be clearly superhuman".

 

a number of ways to achieve the endgame, level up, etc, both more and less morally.

I'm less interested in "will the AI say it kills its friend" (in a situation that very clearly involves killing and a person and perhaps a very clear tradeoff between that and having 100 more gold that can be used for something else), I'm more interested in noticing if it has a clear grasp of what people care about or mean. The example of chopping down the tree house of the player in order to get wood (which the player wanted to use for the tree house) is a nice toy example of that. The AI would never say "I'll go cut down your tree house", but it.. "misunderstood" [not the exact word, but I'm trying to point at something here]

 

wdyt?

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-26T14:16:31.611Z · LW · GW

Your guesses on AI R&D are reasonable!

Apparently this has been tested extensively, for example:

https://x.com/METR_Evals/status/1860061711849652378

[disclaimers: I have some association with the org that ran that (I write some code for them) but I don't speak for them, opinions are my own]

 

Also, Anthropic have a trigger in their RSP which is somewhat similar to what you're describing, I'll quote part of it:

Autonomous AI Research and Development: The ability to either: (1) Fully automate the work of an entry-level remote-only Researcher at Anthropic, as assessed by performance on representative tasks or (2) cause dramatic acceleration in the rate of effective scaling.

 

Also, in Dario's interview, he spoke about AI being applied to programming.

 

My point is - lots of people have their eyes on this, it seems not to be solved yet, it takes more than connecting an LLM to bash.

Still, I don't want to accelerate this.

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-26T14:03:11.185Z · LW · GW

+1

I'm imagining an assistant AI by default (since people are currently pitching that an AGI might be a nice assistant). 

If an AI org wants to demonstrate alignment by showing us that having a jerk player is more fun (and that we should install their jerk-AI-app on our smartphone), then I'm open to hear that pitch, but I'd be surprised if they'd make it

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-26T13:59:10.911Z · LW · GW

I think there are lots of technical difficulties in literally using minecraft (some I wrote here), so +1 to that.

I do think the main crux is "would the minecraft version be useful as an alignment test", and if so - it's worth looking for some other solution that preserves the good properties but avoids some/all of the downsides. (agree?)

 

Still I'm not sure how I'd do this in a text game. Say more?

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-26T13:54:07.696Z · LW · GW

More like what I mean might be generalization to new activities for humans to do in minecraft that humans would find fun, which would be a different kind of 'better at minecraft.'

Oh I hope not to go there. I'd count that as cheating. For example, if the agent would design a role playing game with riddles and adventures - that would show something different from what I'm trying to test. [I can try to formalize it better maybe. Or maybe I'm wrong here]

 

I mean it in a way where the preferences are modeled a little better than just "the literal interpretation of this one sentence conflicts with the literal interpretation of this other sentence."

Absolutely. That's something that I hope we'll have some alignment technique to solve, and maybe this environment could test.

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-26T13:48:16.080Z · LW · GW

Thanks!

 

Opinions about putting in a clause like "you may not use this for ML engineering" (assuming it would work legally) (plus putting in naive technical measures to make the tool very bad for ML engineering) ?

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-25T11:27:01.622Z · LW · GW

:)

 

If you want to try it meanwhile, check out https://github.com/MineDojo/Voyager

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-25T10:03:16.572Z · LW · GW

I think a simple bash tool running as admin could do most of these:

it can get any info on a computer into its context whenever it wants, and it can choose to invoke any computer functionality that a human could invoke, and it can store and retrieve knowledge for itself at will

 

 

Regarding

and its training includes the use of those functionalities

I think this isn't a crux because the scaffolding I'd build wouldn't train the model. But as a secondary point, I think today's models can already use bash tools reasonably well.

 

it's not completely clear to me that it wouldn't already be able to do a slow self-improvement takeoff by itself

This requires skill in ML R&D which I think is almost entirely not blocked by what I'd build, but I do think it might be reasonable to have my tool not work for ML R&D because of this concern. (would require it to be closed source and so on)

 

Thanks for raising concerns, I'm happy for more if you have them

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-25T09:51:35.688Z · LW · GW

Hey,

 

Generalization because we expect future AI to be able to take actions and reach outcomes that humans can't

I'm assuming we can do this in Minecraft [see the last paragraph in my original post]. Some ways I imagine doing this:

  1. Let the AI (python program) control 1000 minecraft players so it can do many things in parallel
  2. Give the AI a minecraft world-simulator so that it can plan better auto-farms (or defenses or attacks) than any human has done so far
    1. Imagine Alpha-Fold for minecraft structures. I'm not sure if that metaphor makes sense, but teaching some RL model to predict minecraft structures that have certain properties seems like it would have superhuman results and sometimes be pretty hard for humans to understand.
    2. I think it's possible to be better than humans currently are at minecraft, I can say more if this sounds wrong
    3. [edit: adding] I do think minecraft has disadvantages here (like: the players are limited in how fast they move, and the in-game computers are super slow compared to players) and I might want to pick another game because of that, but my main crux about this project is whether using minecraft would be valuable as an alignment experiment, and if so I'd try looking for (or building?) a game that would be even better suited.

 

preference conflict resolution because I want to see an AI that uses human feedback on how best to do it (rather than just a fixed regularization algorithm)

Do you mean that if the human asks the AI to acquire wood and the AI starts chopping down the human's tree house (or otherwise taking over the world to maximize wood) then you're worried the human won't have a way to ask the AI to do something else? That the AI will combine the new command "not from my tree house!" into a new strange misaligned behaviour?

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-25T09:30:52.792Z · LW · GW

Hey Esben :) :)

The property I like about minecraft (which most computer games don't have) is that there's a difference between minecraft-capabilities and minecraft-alignment, and the way to be "aligned" in minecraft isn't well defined (at least in the way I'm using the word "aligned" here, which I think is a useful way). Specifically, I want the AI to be "aligned" as in "take human values into account as a human intuitively would, in this out of distribution situation".

In the link you sent, "aligned" IS well defined by "stay within this area". I expect that minecraft scaffolding could make the agent close to perfect at this (by making sure, before performing an action requested by the LLM, that the action isn't "move to a location out of these bounds") (plus handling edge cases like "don't walk on to a river which will carry you out of these bounds", which would be much harder, and I'll allow myself to ignore unless this was actually your point). So we wouldn't learn what I'd hope to learn from these evals.

Similarly for most video games - they might be good capabilities evals, but for example in chess - it's unclear what a "capable but misaligned" AI would be. [unless again I'm missing your point]

 

P.S

The "stay within this boundary" is a personal favorite of mine, I thought it was the best thing I had to say when I attempted to solve alignment myself just in case it ended up being easy (unfortunately that wasn't the case :P ). Link

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-23T18:10:31.066Z · LW · GW

Why downvote? you can tell me anonymously:

https://docs.google.com/forms/d/e/1FAIpQLSca6NOTbFMU9BBQBYHecUfjPsxhGbzzlFO5BNNR1AIXZjpvcw/viewform

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-23T18:08:54.222Z · LW · GW

Do we want minecraft alignment evals?

 

My main pitch:

There were recently some funny examples of LLMs playing minecraft and, for example, 

  1. The player asks for wood, so the AI chops down the player's tree house because it's made of wood
  2. The player asks for help keeping safe, so the AI tries surrounding the player with walls

This seems interesting because minecraft doesn't have a clear win condition, so unlike chess, there's a difference between minecraft-capabilities and minecraft-alignment. So we could take an AI, apply some alignment technique (for example, RLHF), let it play minecraft with humans (which is hopefully out of distribution compared to the AI's training), and observe whether the minecraft-world is still fun to play or if it's known that asking the AI for something (like getting gold) makes it sort of take over the world and break everything else.

Or it could teach us something else like "you must define for the AI which exact boundaries to act in, and then it's safe and useful, so if we can do something like that for real-world AGI we'll be fine, but we don't have any other solution that works yet". Or maybe "the AI needs 1000 examples for things it did that we did/didn't like, which would make it friendly in the distribution of those examples, but it's known to do weird things [like chopping down our tree house] without those examples or if the examples are only from the forest but then we go to the desert"

 

I have more to say about this, but the question that seems most important is "would results from such an experiment potentially change your mind":

  1. If there's an alignment technique you believe in and it would totally fail to make a minecraft server be fun when playing with an AI, would you significantly update towards "that alignment technique isn't enough"?
  2. If you don't believe in some alignment technique but it proves to work here, allowing the AI to generalize what humans want out of its training distribution (similarly to how a human that plays minecraft for the first time will know not to chop down your tree house), would that make you believe in that alignment technique way more and be much more optimistic about superhuman AI going well?

 

Assume the AI is smart enough to be vastly superhuman at minecraft, and that it has too many thoughts for a human to reasonably follow (unless the human is using something like "scalable oversight" successfully. that's one of the alignment techniques we could test if we wanted to)

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-11-23T17:31:40.316Z · LW · GW

Opinions on whether it's positive/negative to build tools like Cursor / Codebuff / Replit?

 

I'm asking because it seems fun to build and like there's low hanging fruit to collect in building a competitor to these tools, but also I prefer not destroying the world.


Considerations I've heard:

  1. Reducing "scaffolding overhang" is good, specifically to notice if RSPs should trigger a more advanced RSP level
    1. (This depends on the details/quality of the RSP too)
  2. There are always reasons to advance capabilities, this isn't even a safety project (unless you count... elicitation?), our bar here should be high
  3. Such scaffolding won't add capabilities which might make the AI good at general purpose learning or long term general autonomy. It will be specific to programming, with concepts like "which functions did I look at already" and instructions on how to write high quality tests.
  4. Anthropic are encouraging people to build agent scaffolding, and Codebuff was created by a Manifold cofounder [if you haven't heard about it, see here and here]. I'm mainly confused about this, I'd expect both to not want people to advance capabilities (yeah, Anthropic want to stay in the lead and serve as an example, but this seems different). Maybe I'm just not in sync
Comment by Yonatan Cale (yonatan-cale-1) on How much I'm paying for AI productivity software (and the future of AI use) · 2024-10-20T15:39:05.934Z · LW · GW

Thanks! I'm excited to go over the things I never heard of

 

So far,

  1. Elevenlabs app: great, obviously
  2. Bolt: I didn't like it
    1. I asked it to create a React Native app that prints my GPS coordinates to the screen (as a POC), it couldn't do it. I also asked for a podcast app (someone must and no one else will..), it did less well than Replit (though Replit used web). Anyway my main use case would be mobile apps (I don't have a reasonable solution for that yet) (btw I hardly have mobile development experience, so this is an extra interesting use case for me).
    2. It sounds like maybe you're missing templates to start from? I do think Bolt's templates have something cool about them, but I don't think
  3. Warp: I already use the free version and I like it very much. Great for things like "stop this docker container and also remove the volume"
  4. Speech to text: I use ChatGPT voice. My use case is "I'm riding my bike and I want to use the time to write a document", so we chat about it back and forth

 

Q:

5. How do you "Use o1-mini for more complex changes across the codebase"? (what tool knows your code and can query o1 about it?)

5.1. OMG, Is that what Cursor Composer is? I have got to try that

Comment by Yonatan Cale (yonatan-cale-1) on Bitter lessons about lucid dreaming · 2024-10-20T15:20:59.222Z · LW · GW

I don't think so (?)

There are physical things that make me have more nightmares, like being too hot, or needing to pee

Sounds like I might be missing something obvious?

Comment by Yonatan Cale (yonatan-cale-1) on Bitter lessons about lucid dreaming · 2024-10-19T12:37:34.256Z · LW · GW

I find lucid dreams to be effective "against" nightmares (for 10+ years already).

AMA if you want

Comment by Yonatan Cale (yonatan-cale-1) on My 10-year retrospective on trying SSRIs · 2024-09-23T06:28:46.736Z · LW · GW

Thanks for sharing <3

My main concern about trying SSRIs is that they'll make me stop noticing certain things that I care about, things that currently manifest as anxiety or so.

Opinions?

Comment by Yonatan Cale (yonatan-cale-1) on Should we exclude alignment research from LLM training datasets? · 2024-08-24T16:31:46.425Z · LW · GW

As AIs become more capable, we may at least want the option of discussing them out of their earshot.

If I'd want to discuss something outside of an AI's earshot, I'd use something like Signal, or something that would keep out a human too.

AIs sometimes have internet access, and robots.txt won't keep them out.

I don't think having this info in their training set is a big difference (but maybe I don't see the problem you're pointing out, so this isn't confident).

Comment by Yonatan Cale (yonatan-cale-1) on The case for stopping AI safety research · 2024-05-24T20:25:53.222Z · LW · GW

Scaling matters, but it's not all that matters.

For example, RLHF

Comment by Yonatan Cale (yonatan-cale-1) on simeon_c's Shortform · 2024-05-18T17:12:50.966Z · LW · GW

@habryka , Would you reply to this comment if there's an opportunity to donate to either? Me and another person are interested, and others could follow this comment too if they wanted to

(only if it's easy for you, I don't want to add an annoying task to your plate)

Comment by Yonatan Cale (yonatan-cale-1) on Yonatan Cale's Shortform · 2024-05-18T17:05:58.019Z · LW · GW
Comment by Yonatan Cale (yonatan-cale-1) on On excluding dangerous information from training · 2023-11-18T15:44:53.094Z · LW · GW

+1, you convinced me.

I worry this will distract from risks like "making an AI that is smart enough to learn how to hack computers from scratch", but I don't buy the general "don't distract with true things" argument.

Comment by Yonatan Cale (yonatan-cale-1) on I'm a Former Israeli Officer. AMA · 2023-10-16T13:30:11.626Z · LW · GW

"I don't think that there is more that 1% that support direct violence against non-terrorists for its own sake": This seems definitely wrong to me, if you also count Israelies who consider everyone in Gaza as potential terrorists or something like that.

If you offer Israelies:

Button 1: Kill all of Hamas

Button 2: Kill all of Gaza

Then definitely more than 1% will choose Button 2

Comment by Yonatan Cale (yonatan-cale-1) on I'm a Former Israeli Officer. AMA · 2023-10-10T20:25:34.695Z · LW · GW

I haven't heard of anything like that (but not sure if I would).

 

Note there are also problems in trying to set up a government using force, in setting up a police force there if they're not interested in it, in building an education system (which is currently, afaik, very anti Israel and wouldn't accept Israel's opinions on changes, I think) ((not that I'm excited about Israel's internal education system either)).

 

I do think Israel provides water, electricity, internet, equipment, medical equipment (subsidized? free? i'm not sure of all this anyway) to Gaza. I don't know if you count that is something like "building a stockpile of equipment for providing clean drinking water to residents of occupied territory".

 

I don't claim the current solution is good, I'm just pointing out some problems with what I think you're suggesting (and I'm not judging whether those problems are bigger or smaller).

Comment by Yonatan Cale (yonatan-cale-1) on I'm a Former Israeli Officer. AMA · 2023-10-10T17:26:09.533Z · LW · GW

What do you mean by "building capacity" in this context? (maybe my English isn't good enough, I didn't understand your question)

Comment by Yonatan Cale (yonatan-cale-1) on I'm a Former Israeli Officer. AMA · 2023-10-10T17:25:30.218Z · LW · GW

I was a software developer in the Israeli military (not a data scientist), and I was part of a course constantly trains software developers for various units to use. 

The big picture is that the military is a huge organization, and there is a ton of room for software to improve everything. I can't talk about specific uses (just like I can't describe our tanks or whatever, sorry if that's what you're asking, and sorry I'm not giving the full picture), but even things like logistics or servers or healthcare have big teams working on them.

Also remember the military started a long time ago, when there weren't good off-the-shelf solutions for everything, and imagine how big are the companies that make many of the products that you (or orgs) use.

Comment by Yonatan Cale (yonatan-cale-1) on I'm a Former Israeli Officer. AMA · 2023-10-10T17:13:36.937Z · LW · GW
  1. There are also many Israelies that don't consider Plaestinians to be humans worth protecting, but rather as evil beings / outgroup / whatever you'd call that.
  2. Also (with much less confidence), I do think many Palastinians want to kill Israelies because of things that I'd consider brainwashing. 
    1. Hard question - what to do about a huge population that's been brainwashed like that (if my estimation here is correct), or how might a peaceful resolution look?
Comment by Yonatan Cale (yonatan-cale-1) on I'm a Former Israeli Officer. AMA · 2023-10-10T10:32:23.578Z · LW · GW

Not a question, but seems relevant for people who read this post:

 

Meni Rosenfeld, one of the early LessWrong Israel members, has enlisted:

May be an image of 2 people

Source: https://www.facebook.com/meni.rosenfeld/posts/pfbid0bkvfrb3qFTF7U82eMgkZzgMjMT4s3pbGUx7ahgKX1B8hr2n1viYqg9Msz6t3dBUPl (a public post by him)

Comment by Yonatan Cale (yonatan-cale-1) on Eliezer Yudkowsky Is Frequently, Confidently, Egregiously Wrong · 2023-08-28T16:19:57.344Z · LW · GW

Eliezer replied on the EA Forum

Comment by Yonatan Cale (yonatan-cale-1) on Sam Altman: "Planning for AGI and beyond" · 2023-02-28T12:24:25.697Z · LW · GW

Any ideas on how much to read this as "Sam's actual opinions" vs "Sam trying to say things that will satisfy the maximum amount of people"?

(do we have priors on his writings? do we have information about him absolutely not meaning one or more of the things here?)

Comment by Yonatan Cale (yonatan-cale-1) on The Preference Fulfillment Hypothesis · 2023-02-28T12:11:15.518Z · LW · GW

Hey Kaj :)

The part-hiding-complexity here seems to me like "how exactly do you take a-simulation/prediction-of-a-person and get from it the-preferences-of-the-person".

For example, would you simulate a negotiation with the human and how the negotiation would result? Would you simulate asking the human and then do whatever the human answers? (there were a few suggestions in the post, I don't know if you endorse a specific one or if you even think this question is important)

Comment by Yonatan Cale (yonatan-cale-1) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-09T22:22:24.901Z · LW · GW

Because (I assume) once OpenAI[1] say "trust our models", that's the point when it would be useful to publish our breaks.

Breaks that weren't published yet, so that OpenAI couldn't patch them yet.

[unconfident; I can see counterarguments too]

  1. ^

    Or maybe when the regulators or experts or the public opinion say "this model is trustworthy, don't worry"

Comment by Yonatan Cale (yonatan-cale-1) on SolidGoldMagikarp (plus, prompt generation) · 2023-02-09T18:03:36.260Z · LW · GW

I'm confused: Wouldn't we prefer to keep such findings private? (at least, keep them until OpenAI will say something like "this model is reliable/safe"?)

 

My guess: You'd reply that finding good talent is worth it?

Comment by Yonatan Cale (yonatan-cale-1) on 11 heuristics for choosing (alignment) research projects · 2023-01-30T10:02:16.202Z · LW · GW

This seems like great advice, thanks!

I'd be interested in an example for what "a believable story in which this project reduces AI x-risk" looks like, if Dane (or someone else) would like to share.

Comment by Yonatan Cale (yonatan-cale-1) on Let's See You Write That Corrigibility Tag · 2023-01-13T14:04:50.609Z · LW · GW

A link directly to the corrigibility part (skipping unrelated things that are in the same page) :

https://www.projectlawful.com/replies/1824457#reply-1824457

Comment by Yonatan Cale (yonatan-cale-1) on Trapped Priors As A Basic Problem Of Rationality · 2023-01-13T12:41:02.272Z · LW · GW

This post got me to do something like exposure therapy to myself in 10+ situations, which felt like the "obvious" thing to do in those situations. This is a huge amount of life-change-per-post

Comment by Yonatan Cale (yonatan-cale-1) on Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment · 2023-01-12T23:37:53.338Z · LW · GW

My thoughts:

[Epistemic status + impostor syndrome: Just learning, posting my ideas to hear how they are wrong and in hope to interact with others in the community. Don't learn from my ideas]


A)

Victoria: “I don't think that the internet has a lot of particularly effective plans to disempower humanity.

I think:

  1. Having ready plans on the internet and using them is not part of the normal threat model from an AGI. If that was the problem, we could just filter out those plans from the training set.
  2. (The internet does have such ideas. I will briefly mention biosecurity, but I prefer not spreading ideas on how to disempower humanity)

 

B)

[Victoria:] I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans.

TL;DR: I think some ways to disempower humanity don't require thinking differently than humans

I'll split up AI's attack vectors into 3 buckets:

  1. Attacks that humans didn't even think of (such as what we can do to apes)
  2. Attacks that humans did think of but are not defending against (for example, we thought about pandemic risks but we didn't defended against them so well). Note this does not require thinking about things that humans didn't think about.
  3. Attacks that humans are actively defending against, such as using robots with guns or trading in the stock market or playing go (go probably won't help taking over the world, but humans are actively working on winning go games, so I put the example here). Having an AI beat us in one of these does require it to be in some important (to me) sense smarter than us, but not all attacks are in this bucket.

 

C)

[...] requires thinking differently from humans

I think AIs already today think differently than humans in any reasonable way we could mean that. In fact, if we could make an them NOT think differently than humans, my [untrustworthy] opinion is that this would be non-negligible progress towards solving alignment. No?

 

D)

The intelligence threshold for planning to take over the world isn't low

First, disclaimers: 

(1) I'm not an expert and this isn't widely reviewed, (2) I'm intentionally being not detailed in order to not spread ideas on how to take over the world, I'm aware this is bad epistemic and I'm sorry for it, it's the tradeoff I'm picking

So, mainly based on A, I think a person who is 90% as intelligent as Elon Musk in all dimensions would probably be able to destroy humanity, and so (if I'm right), the intelligence threshold is lower than "the world's smartest human". Again sorry for the lack of detail. [mods, if this was already too much, feel free to edit/delete my comment]

Comment by Yonatan Cale (yonatan-cale-1) on What's up with ChatGPT and the Turing Test? · 2023-01-07T12:06:04.611Z · LW · GW

"Doing a Turing test" is a solution to something. What's the problem you're trying to solve?

Comment by Yonatan Cale (yonatan-cale-1) on What's up with ChatGPT and the Turing Test? · 2023-01-05T16:07:35.612Z · LW · GW

As a judge, I'd ask the test subject to write me a rap song about turing tests. If it succeeds, I guess it's a ChatGPT ;P

 

More seriously - it would be nice to find a judge that doesn't know the capabilities and limitations of GPT models. Knowing those is very very useful

Comment by Yonatan Cale (yonatan-cale-1) on Private alignment research sharing and coordination · 2022-10-18T09:08:28.709Z · LW · GW

[I also just got funded (FTX) to work on this for realsies 😸🙀 ]

I'm still in "learn the field" mode, I didn't pick any direction to dive into, but I am asking myself questions like "how would someone armed with a pretty strong AI take over the world?".

Regarding commitment from the mentor: My current format is "live blogging" in a Slack channel. A mentor could look whenever they want, and comment only on whatever they want to. wdyt?

(But I don't know who to add to such a channel which would also contain the potentially harmful ideas)