Posts

How to safely use an optimizer 2024-03-28T16:11:01.277Z
AISC team report: Soft-optimization, Bayes and Goodhart 2023-06-27T06:05:35.494Z
Pivotal acts using an unaligned AGI? 2022-08-21T17:13:55.830Z
The Hammer and the Mask - A call to action 2020-04-19T13:36:03.448Z

Comments

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-04-11T15:51:14.104Z · LW · GW

Thanks for your comment! I didn't get around to answering earlier, but maybe it's still useful to try to clarify a few things.

If I understand the setup correctly, there's no guarantee that the optimal element would be good, right?

My threat model here is that we have access to an Oracle that's not trustworthy (as specified in the first paragraph), so that even if we were able to specify our preferences correctly, we would still have a problem. So in this context you could assume that we managed to specify our preferences correctly. If our problem is simply that we misspecified our preferences (this would roughly correspond to "fail at outer alignment" vs my threat model of "fail at inner alignment), solving this by soft-maximization is much easier: Just "flatten" the top part of your utility function (i.e. make all outputs that satisfice have the same utility) and add some noise, then hand it to an optimizer.

So I guess the point I tried to make with my post could be stated as "Soft-optimization can also be used to help with inner alignment, not just outer alignment" (and I freely admit that I should have said so in the post).

It's just likely since the optimal element a priori shouldn't be unusually bad, and you're assuming most satisficing elements are fine.

I'm not just assuming that, I gave a bit of an argument for why I believe that to be that case: I assume that an output is not randomly dangerous, but only dangerous if it was specifically chosen to achieve a goal that differs from ours. This only holds if the goal is not too ambitious, e.g. if we ask the AI controlling the paperclip factory for 10^30 paperclips that will not go well.

As I understand, gwern has a stronger opinion on that and believes that side-effects from less ambitious plans are still an important concern. But at least my proposal gets rid of the potentially adversarial selection by the untrusted optimizer, so I believe that getting safe outputs (e.g. by combining our goal with impact regularization) should be much easier.

I'm still unsure about how useful this problem setup is. For example, we'd probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle).

My assumption about the black-box untrusted optimizer is a worst-case assumption. So I'm trying to understand if we could get something useful out of the system even in that case. If we can make stronger assumptions about what we know about the system, the problem gets of course much easier. Given the general state of progress in interpretability compared with progress in capabilities, I think it's worth thinking about this worst-case scenario a bit.

I like your suggestion of "train the weakest system that can give us satisficing outputs", but I think doing that might actually be quite difficult: How do we measure (or otherwise predict) how weak/strong our system and how easy/difficult the satisficing task is? Can these things even be measured on a 1-d scale? Trying to increase to capability slowly until the system solves the task runs into the problem mentioned in footnote 10 if the systems deceptively pretends to be weaker than it is.

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-03-29T10:21:37.208Z · LW · GW

I think that, if you are wanting a formally verified proof of some maths theorem out of the oracle, then this is getting towards actually likely to not kill you.

Yes, I believe that's within reach using this technique.

 

You can start with m huge, and slowly turn it down, so you get a long list of "no results", followed by a proof. (Where the optimizer only had a couple of bits of free optimization in choosing which proof.) 

This is quite dangerous though if the Oracle is deceptively withholding answers; I commented on this in the last paragraph of this section.

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-03-28T23:49:13.989Z · LW · GW

I'm not sure I understand your question. What restriction do you have in mind? A safety restriction on what the generated code should be like? Something like requiring the code should be in some canonical form to remove degrees of freedom for the (potential malicious) code generating AI?

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-03-28T23:26:35.573Z · LW · GW

How do we prove the AI tools we used didn't insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can't be tested for and therefore won't be found in validation....

I believe this exactly the kind of thing that my proposal would be good for: Gnarly backdoors that exploit a compiler bug etc. should be very rare in the set of all valid implementations!

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-03-28T23:13:19.645Z · LW · GW

I think you mean  (two answers that satisfice and fulfill the safety constraint), but otherwise I agree. This is also an example of this whole "let's measure optimization in bits"-business being a lot more subtle than it appears at first sight.

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-03-28T21:57:16.299Z · LW · GW

First thought: The oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.

Agreed. I think it's potentially a good bit worse than one kilobyte if let ourselves bet tricked to ask many questions, different questions or lower the difficulty of the safety constraint too much. 

As mentioned in footnote 10, this requires a kind of perfect coordination between different Oracles that we manage to build, where the Oracles have to be able to correctly predict for all kinds of questions whether humanity (with the help of other Oracles) has been able to figure out the answers (or the rough sizes of various sets). Otherwise, when caught in deception, we can (as you noted) scrap or retrain the Oracles.

Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they've been satisficed? Are they the kind of problems where solving them can save the world?

You're correct in that this approach is only useful as far as validation is easier then generation. Though for this technique, the validation doesn't have to be by a formal proof checker, any program that you can run suffices. It might even be a very slow program (e.g. a big LLM) if you have an effective way communicate your goal set to the Oracle (e.g. using a natural language prompt as we already use today when using current AIs).

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-03-28T21:43:10.013Z · LW · GW

We probably would've been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.

I completely agree with that. My proposal does not address the global coordination problem that we face, but it might be a useful tool if we collectively get our act together or if the first party with access to superintelligence has enough slack to proceed extra carefully. Even more modestly, I was hoping this might contribute to our theoretical understanding of why soft-optimization can be useful.

Comment by Simon Fischer (SimonF) on How to safely use an optimizer · 2024-03-28T20:09:12.070Z · LW · GW

The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to 'safe' solutions reported by the Oracle.

Sure, I mostly agree with the distinction you're making here between "sins of commission" and "sins of omissions". Contrary to you, though, I believe that getting rid of the threat of "sins of commission" is extremely useful. If the output of the Oracle is just optimized to fulfill your satisfaction goal and not for anything else, you've basically gotten rid of the superintelligent adversary in your threat model.

'Devising a plan to take over the world' for a misaligned Oracle is not difficult, it is easy, because the initial steps like 'unboxing the Oracle' are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions 'taking over the world' as the goal. ("Tool AIs want to be Agent AIs.") To be safe, an Oracle has to have a goal of not taking over the world.

I agree that for many ambitious goals, 'unboxing the Oracle' is an instrumental goal. It's overwhelmingly important that we use such an Oracle setup only for goals that are achievable without such instrumental goals being pursued as a consequence of a large fraction of the satisficing outputs. (I mentioned this in footnote 2, but probably should have highlighted it more.) I think this is a common limitation of all soft-optimization approaches.

There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit.

This is talking about a different threat model than mine. You're talking here about security in a more ordinary sense, as in "secure from being hacked by humans" or "secure from accidentally leaking dangerous information". I feel like this type of security concerns should be much easier to address, as you're defending yourself not against superintelligences but against humans and accidents.

The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It's trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great "cure all cancer" plan. (I don't believe that either you or I would have made that mistake!)

Comment by Simon Fischer (SimonF) on 'Empiricism!' as Anti-Epistemology · 2024-03-14T23:25:03.100Z · LW · GW

Ah, I think there was a misunderstanding. I (and maybe also quetzal_rainbow?) thought that in the inverted world also no "apparently-very-lucrative deals" that turn out to be scams are known, whereas you made a distinction between those kind of deals and Ponzi schemes in particular.

I think my interpretation is more in the spirit of the inversion, otherwise the Epistemologist should really have answered as you suggested, and the whole premise of the discussion (people seem to have trouble understanding what the Spokesperson is doing) is broken.

Comment by Simon Fischer (SimonF) on 'Empiricism!' as Anti-Epistemology · 2024-03-14T21:13:32.950Z · LW · GW

I think this would be a good argument against Said Achmiz's suggested response, but I feel the text doesn't completely support it, e.g. the Epistemologist says "such schemes often go through two phases" and "many schemes like that start with a flawed person", suggesting that such schemes are known to him.

Comment by Simon Fischer (SimonF) on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI · 2024-01-29T10:10:41.352Z · LW · GW

In Section 5 we discuss why expect oversight and control of powerful AIs to be difficult.

Another typo, probably missing a "we".

Comment by Simon Fischer (SimonF) on This might be the last AI Safety Camp · 2024-01-26T19:41:09.380Z · LW · GW

The soft optimization post took 24 person-weeks (assuming 4 people half-time for 12 weeks) plus some of Jeremy's time.

Team member here. I think this is a significant overestimate, I'd guess at 12-15 person-weeks. If it's relevant I can ask all former team members how much time they spent; it was around 10h per week for me. Given that we were beginners and spent a lot of time learning about the topic, I feel we were doing fine and learnt a lot. 

Working on this part-time was difficult for me and the fact that people are not working on these things full-time in the camp should be considered when judging research output.

Comment by Simon Fischer (SimonF) on Notice When People Are Directionally Correct · 2024-01-15T10:31:50.778Z · LW · GW

Missile attacks are not piracy, though, right?

It's good that you learned a few things from these incidents, but I'm sceptical of the (different) claim implied by the headline that Peter Zeihan was meaningfully correct here. If you interpret "directions" imprecisely enough, it's not hard to be sometimes directionally correct.

Comment by Simon Fischer (SimonF) on Decent plan prize announcement (1 paragraph, $1k) · 2024-01-12T10:23:04.080Z · LW · GW

I know this answer doesn't qualify, but very likely the best you can currently do is: Don't do it. Don't train the model.

Comment by Simon Fischer (SimonF) on Apologizing is a Core Rationalist Skill · 2024-01-04T12:23:16.378Z · LW · GW

(I downvoted your comment because it's just complaining about downvotes to unrelated comments/posts and not meaningfully engaging with the topic at hand)

Comment by Simon Fischer (SimonF) on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-24T14:25:06.427Z · LW · GW

"Powerful AIs Are Black Boxes" seems like a message worth sending out

Everybody knows what (computer) scientists and engineers mean by "black box", of course.

Comment by Simon Fischer (SimonF) on Could Germany have won World War I with high probability given the benefit of hindsight? · 2023-11-29T19:48:49.939Z · LW · GW

I guess it's hard to keep "they are experimenting with / building huge amounts of tanks" and "they are conducting combined arms exercises" secret from France and Russia, so they would have a lot of advance warning and could then also develop tanks.

But if you have lot more than a layman's understanding of tank design / combined arms doctrine, you could still come out ahead in this.

Comment by Simon Fischer (SimonF) on Deception Chess: Game #2 · 2023-11-29T11:15:28.870Z · LW · GW

 "6. f6" should be "6. h3".

Comment by Simon Fischer (SimonF) on Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI" · 2023-11-21T22:06:20.971Z · LW · GW

Microsoft is the sort of corporate bureaucracy where dynamic orgs/founders/researchers go to die. My median expectation is that whatever former OpenAI group ends up there will be far less productive than they were at OpenAI.


I'm a bit sceptical of that. You gave some reasonable arguments, but all of this should be known to Sam Altman, and he still chose to accept Microsoft's offer instead of founding his own org (I'm assuming he would easily able to raise a lot of money). So, given that "how productive are the former OpenAI folks at Microsoft?" is the crux of the argument, it seems that recent events are good news iff Sam Altman made a big mistake with that decision.

Comment by Simon Fischer (SimonF) on Does davidad's uploading moonshot work? · 2023-11-15T12:22:55.370Z · LW · GW

I'm confused by this statement. Are you assuming that AGI will definitely be built after the research time is over, using the most-plausible-sounding solution?

Or do you believe that you understand NOW that a wide variety of approaches to alignment, including most of those that can be thought of by a community of non-upgraded alignment researchers (CNUAR) in a hundred years, will kill everyone and that in a hundred years the CNUAR will not understand this?

If so, is this because you think you personally know better or do you predict the CNUAR will predictably update in the wrong direction? Would it matter if you got to choose the composition of the CNUAR?

Comment by Simon Fischer (SimonF) on Does davidad's uploading moonshot work? · 2023-11-12T19:55:25.255Z · LW · GW

Another big source of potential volunteers: People who are going to be dead soon anyway. I'd probably volunteer if I knew that I'm dying from cancer in a few weeks anyway.

Comment by Simon Fischer (SimonF) on Game Theory without Argmax [Part 1] · 2023-11-12T11:22:23.686Z · LW · GW

Typo: This should be .
 

Comment by Simon Fischer (SimonF) on Deception Chess: Game #1 · 2023-11-06T12:54:44.472Z · LW · GW

after 17... dxc6 or 17. c6

This should probably be "after 17... cxd6 or 17... c6".

Comment by Simon Fischer (SimonF) on The Good Life in the face of the apocalypse · 2023-10-17T13:30:41.450Z · LW · GW

I suspect Wave refers to this company: https://www.wave.com/en/ (they are connected to EA)

Planecrash is a glowfic co-written by Yudkowsky: https://glowficwiki.noblejury.com/books/planecrash

Comment by Simon Fischer (SimonF) on I'm consistently overwhelmed by basic obligations. Are there any paradigm shifts or other rationality-based tips that would be helpful? · 2023-07-23T12:45:26.696Z · LW · GW

Seconding the recommendation of the rest in motion post, it has helped me with a maybe-similar feeling.

Comment by Simon Fischer (SimonF) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-10T02:04:21.984Z · LW · GW

I don't believe these "practical" problems ("can't try long enough") generalize enough to support your much more general initial statement. This doesn't feel like a true rejection to me, but maybe I'm misunderstanding your point.

Comment by Simon Fischer (SimonF) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-10T01:50:55.693Z · LW · GW

I think I mostly agree with this, but from my perspective it hints that you're framing the problem slightly wrong. Roughly, the problem with the outsourcing-approaches is our inability to specify/verify solutions to the alignment problem, not that specifying is not in general easier than solving yourself.

(Because of the difficulty of specifying the alignment problem, I restricted myself to speculating about pivotal acts in the post linked above.)

Comment by Simon Fischer (SimonF) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-10T01:40:56.120Z · LW · GW

But you don't need to be able to code to recognize that a software is slow and buggy!?

About the terrible UI part I agree a bit more, but even there one can think of relatively objective measures to check usability without being able to speak python.

Comment by Simon Fischer (SimonF) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-10T01:23:18.741Z · LW · GW

In cases where outsourcing succeeds (to various degrees), I think the primary load-bearing mechanism of success in practice is usually not "it is easier to be confident that work has been done correctly than to actually do the work", at least for non-experts.

I find this statement very surprising. Isn't almost all of software development like this?
E.g., the client asks the developer for a certain feature and then clicks around the UI to check if it's implemented / works as expected.

Comment by Simon Fischer (SimonF) on Why Not Just Outsource Alignment Research To An AI? · 2023-03-10T00:39:56.239Z · LW · GW

"This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform."
This proves way too much.

I agree, I think this even proves P=NP.

Maybe a more reasonable statement would be: You can not outsource cognitive labor if you don't know how to verify the solution. But I think that's still not completely true, given that interactive proofs are a thing. (Plug: I wrote a post exploring the idea of applying interactive proofs to AI safety.)

Comment by Simon Fischer (SimonF) on Pivotal acts using an unaligned AGI? · 2022-08-22T13:12:46.411Z · LW · GW

No, that's not quite right. What you are describing is the NP-Oracle.

On the other hand, with the IP-Oracle we can (in principle, limited by the power of the prover/AI) solve all problems in the PSPACE complexity class.

Of course, PSPACE is again a class of decision problems, but using binary search it's straightforward to extract complete answers like the designs mentioned later in the article.

Comment by Simon Fischer (SimonF) on Two-year update on my personal AI timelines · 2022-08-05T13:10:54.137Z · LW · GW

Your reasoning here relies on the assumption that the learning mostly takes place during the individual organisms lifetime. But I think it's widely accepted that brains are not "blank slates" at birth of the organism, but contain significant amount of information, akin to a pre-trained neural network. Thus, if we consider evolution as the training process, we might reach the opposite conclusion: Data quantity and training compute are extremely high, while parameter count (~brain size) and brain compute is restricted and selected against.

Comment by Simon Fischer (SimonF) on Mandatory Post About Monkeypox · 2022-05-25T13:40:45.540Z · LW · GW

Thank you for writing about this! A minor point: I don't think aerosolizing monkeypox suspensions using a nebulizer can be counted as gain of function research, not even "at least kind of". (Or do I lack reading comprehension and misunderstood something?)

Comment by Simon Fischer (SimonF) on Project Intro: Selection Theorems for Modularity · 2022-04-07T20:11:13.946Z · LW · GW

Hypothesis: If a part of the computation that you want your trained system to compute "factorizes", it might be easier to evolve a modular system for this computation. By factorization I just mean that (part of) the computation can be performed using mostly independent parts / modules.

Reasoning: Training independent parts to each perform some specific sub-calculation should be easier than training the whole system at once. E.g. training n neural networks of size N/n should be easier (in terms of compute or data needed) than training one of size N, given the exponential size of the parameter space.

This hypothesis might explain the appearance of modularity if the necessary initial conditions for this selective advantage to be used are regularly present.

(I've talked about this idea with Lblack already but wanted to spell it out a bit more and post it here for reference.)

Comment by Simon Fischer (SimonF) on The Hammer and the Mask - A call to action · 2020-04-20T22:00:30.468Z · LW · GW
Yes, this mask is more of a symbolic pic, perhaps Simon can briefly explain why he chose this one (copyright issues I think).

Yep, it's simply the first one in open domain that I found. I hope it's not too misleading; it should get the general idea across.

Comment by Simon Fischer (SimonF) on The Hammer and the Mask - A call to action · 2020-04-19T19:13:06.011Z · LW · GW

Exactly! :)

Comment by Simon Fischer (SimonF) on The Hammer and the Mask - A call to action · 2020-04-19T18:20:42.711Z · LW · GW

Well, if your chances of getting infected are drastically reduced, then so is the use of the "protect others" effect of wearing the mask, so overall these masks are likely to be very useful.

That said, a slightly modified design that filters air both on the in- and the out- breath might be a good idea. This way, you keep your in-breath filters dry and have some "protect others" effect.

Comment by Simon Fischer (SimonF) on Hammer and Mask - Wide spread use of reusable particle filtering masks as a SARS-CoV-2 eradication strategy · 2020-04-15T10:40:32.818Z · LW · GW
[...] P3 masks, worn properly, with appropriate eye protection while maintaining basic hand hygiene are efficient in preventing SARS-CoV-2 infection regardless of setting.

If this is true, then this is a great idea and it's somewhat suprising that these masks are not in widespread use already.

I suspect the plan is a bit less practical than stated, as I expect there to be problems with compliance, in particular because the mask are mildly unpleasant to wear for prolonged periods.

Comment by Simon Fischer (SimonF) on LessWrong Help Desk - free paper downloads and more (2014) · 2015-07-23T22:20:08.832Z · LW · GW

They have a copy at our university library. I would need to investigate how to scan it efficiently, but I'm up for it if there isn't an easier way and noone else finds a digital copy.

Comment by Simon Fischer (SimonF) on Astronomy, space exploration and the Great Filter · 2015-05-18T21:27:10.890Z · LW · GW

Definitely Main, I found your post (including the many references) and the discussion very interesting.

Comment by Simon Fischer (SimonF) on Debunking Fallacies in the Theory of AI Motivation · 2015-05-05T22:11:05.781Z · LW · GW

I still agree with Eli and think you're "really failing to clarify the issue", and claiming that xyz is not the issue does not resolve anything. Disengaging.

Comment by Simon Fischer (SimonF) on Debunking Fallacies in the Theory of AI Motivation · 2015-05-05T21:12:50.696Z · LW · GW

The paper had nothing to do with what you talked about in your opening paragraph

What? Your post starts with:

My goal in this essay is to analyze some widely discussed scenarios that predict dire and almost unavoidable negative behavior from future artificial general intelligences, even if they are programmed to be friendly to humans.

Eli's opening paragraph explains the "basic UFAI doomsday scenario". How is this not what you talked about?

Comment by Simon Fischer (SimonF) on Deregulating Distraction, Moving Towards the Goal, and Level Hopping · 2014-01-16T13:59:02.989Z · LW · GW

What's Worm? Oh, wait..

Comment by Simon Fischer (SimonF) on Meetup : First Meetup in Cologne (Köln) · 2013-10-24T10:09:47.011Z · LW · GW

Awesome, a meetup in Cologne. I'll try to be there, too. :)

Comment by Simon Fischer (SimonF) on Does Checkers have simpler rules than Go? · 2013-08-20T16:04:52.654Z · LW · GW

It depends on the skill difference and the size of the board, on smaller boards the advantage is probably pretty large: Discussion on LittleGolem

Comment by Simon Fischer (SimonF) on MIRI's 2013 Summer Matching Challenge · 2013-08-15T11:43:21.507Z · LW · GW

66$, with some help of a friend.

Comment by Simon Fischer (SimonF) on The Robots, AI, and Unemployment Anti-FAQ · 2013-07-26T21:32:16.074Z · LW · GW

Regarding the drop of unemployment in Germany, I've heard it claimed that it is mainly due to changing the way the unemployment statististics are done, e.g. people who are in temporary, 1€/h jobs and still receiving benefits are counted als employed. If this point is still important, I can look for more details and translate.

EDIT: Some details are here:

It is possible to earn income from a job and receive Arbeitslosengeld II benefits at the same time. [...] There are criticisms that this defies competition and leads to a downward spiral in wages and the loss of full-time jobs. [...]

The Hartz IV reforms continue to attract criticism in Germany, despite a considerable reduction in short and long term unemployment. This reduction has led to some claims of success for the Hartz reforms. Others say the actual unemployment figures are not comparable because many people work part-time or are not included in the statistics for other reasons, such as the number of children that live in Hartz IV households, which has risen to record numbers.

Comment by Simon Fischer (SimonF) on 2012: Year in Review · 2013-01-03T14:13:25.616Z · LW · GW

Nope, it's still broken.

Comment by Simon Fischer (SimonF) on Checking Kurzweil's track record · 2012-11-05T14:10:46.478Z · LW · GW

I will do 20, too!

Comment by Simon Fischer (SimonF) on A cynical explanation for why rationalists worry about FAI · 2012-08-04T23:19:05.420Z · LW · GW

Isn't "exploring many unusual and controversial ideas" what scientists usually do? (Ok, maybe sometimes good scientist do it...) Don't you think that science could contribute to saving the world?