Posts

Infra-Bayesian Logic 2023-07-05T19:16:41.811Z
Yoshua Bengio: How Rogue AIs may Arise 2023-05-23T18:28:27.489Z
harfe's Shortform 2022-09-01T22:02:25.267Z

Comments

Comment by harfe on Yitz's Shortform · 2024-02-10T16:47:35.483Z · LW · GW

This sounds like https://www.super-linear.org/trumanprize. It seems like it is run by Nonlinear and not FTX.

Comment by harfe on Basic Inframeasure Theory · 2024-01-10T19:25:35.183Z · LW · GW

I think Proposition 1 is false as stated because the resulting functional is not always continuous (wrt the KR-metric). The function , with should be a counterexample. However, the non-continuous functional should still be continuous on the set of sa-measures.

Another thing: the space of measures is claimed to be a Banach space with the KR-norm (in the notation section). Afaik this is not true, while the space is a Banach space with the TV-norm, with the KR-metric/norm it should not be complete and is merely a normed vector space. Also the claim (in "Basic concepts") that is the dual space of is only true if equipped with TV-norm, not with KR-metric.

Another nitpick: in Theorem 5, the type of in the assumption is probably meant to be , instead of .

Comment by harfe on The Learning-Theoretic Agenda: Status 2023 · 2024-01-09T16:21:45.645Z · LW · GW

Regarding direction 17: There might be some potential drawbacks to ADAM. I think its possible that some very agentic programs have relatively low score. This is due to explicit optimization algorithms being low complexity.

(Disclaimer: the following argument is not a proof, and appeals to some heuristics/etc. We fix for these considerations too.) Consider an utility function . Further, consider a computable approximation of the optimal policy (AIXI that explicitly optimizes for ) and has an approximation parameter n (this could be AIXI-tl, plus some approximation of ; higher is better approximation). We will call this approximation of the optimal policy . This approximation algorithm has complexity , where is a constant needed to describe the general algorithm (this should not be too large).

We can get better approximation by using a quickly growing function, such as the Ackermann function with . Then we have .

What is the score of this policy? We have . Let be maximal in this expression. If , then .

For the other case, let us assume that if , the policy is at least as good at maximizing than . Then, we have .

I don't think that the assumption ( maximizes better than ) is true for all and , but plausibly we can select such that this is the case (exceptions, if they exist, would be a bit weird, and if ADAM working well due to these weird exceptions feels a bit disappointing to me). A thing that is not captured by approximations such as AIXI-tl are programs that halt but have insane runtime (longer than ). Again, it would feel weird to me if ADAM sort of works because of low-complexity extremely-long-running halting programs.

To summarize, maybe there exist policies which strongly optimize a non-trivial utility function with approximation parameter , but where is relatively small.

Comment by harfe on How Would an Utopia-Maximizer Look Like? · 2023-12-21T11:59:52.172Z · LW · GW

I think the "deontological preferences are isomorphic to utility functions" is wrong as presented.

Firts, the formula has issues with dividing by zero and not summing probabilities to one (and re-using variable as a local variable in the sum). So you probably meant something like Even then, I dont think this describes any isomorphism of deontological preferences to utility functions.

  • Utility functions are invariant when multiplied with a positive constant. This is not reflected in the formula.

  • utility maximizers usually take the action with the best utility with probability , rather than using different probabilities for different utilities.

  • modelling deontological constraints as probability distributions doesnt seem right to me. Let's say I decide between drinking green tea and black tea, and neither of those violate any deontological constraints, then assigning some values (which ones?) to P("I drink green tea") or P("I drink black tea") doesnt describe these deontological constraints well.

  • any behavior can be encoded as utility functions, so finding any isomorphisms to utility functions is usually possible, but not always meaningful.

Comment by harfe on [deleted post] 2023-11-13T13:12:42.771Z

Some of the downvotes were probably because of the unironic use of the term TESCREAL. This term mixes a bunch of different things together, which makes your writing less clear.

Comment by harfe on Buck's Shortform · 2023-10-16T15:46:12.775Z · LW · GW

Sure, I'd be happy to read a draft

Comment by harfe on Buck's Shortform · 2023-10-15T22:31:06.530Z · LW · GW

I am going to assume that in the code, when calculating p_alice_win_given_not_caught, we do not divide the term by two (since this is not that consistent with the description. I am also assuming that is a typo and is meant, which would also be more consistent with other stuff). So I am going to assume assume a symmetrical version.

Here, P(Alice wins) is . Wlog we can assume (otherwise Bob will run everything or nothing in shielded mode).

We claim that is a (pure) Nash equilibrium, where .

To verify, lets first show that Alice cannot make a better choice if Bob plays . We have . Since this only depends on the sum, we can make the substitution . Thus, we want to maximize . We have . Rearranging, we get . Taking logs, we get . Rearranging, we get . Thus, is the optimal choice. This means, that if Bob sticks to his strategy, Alice cannot do better than .

Now, lets show that Bob cannot do better. We have . This does not depend on and anymore, so any choice of and is optimal if Alice plays .

(If I picked the wrong version of the question, and you actually want some symmetry: I suspect that the solution will have similarities, or that in some cases the solution can be obtained by rescaling the problem back into a more symmetric form.)

Comment by harfe on To open-source or to not open-source, that is (an oversimplification of) the question. · 2023-10-13T17:57:57.290Z · LW · GW

This article talks a lot about risks from AI. I wish the author would be more specific what kinds of risks they are thinking about. For example, it is unclear which parts are motivated by extinction risks or not. The same goes for the benefits of open-sourcing these models. (note: I haven't read the reports this article is based on, these might have been more specific)

Comment by harfe on Provably Safe AI · 2023-10-05T23:06:13.946Z · LW · GW

Thank you for writing this review.

The strategy assumes we'll develop a good set of safety properties that we're demanding proof of.

I think this is very important. From skimming the paper it seems that unfortunately the authors do not discuss it much. I imagine that actually formally specifying safety properties is actually a rather difficult step.

To go with the example of not helping terrorists spread harmful virus: How would you even go about formulating this mathematically? This seems highly non-trivial to me. Do you need to mathematically formulate what exactly are harmful viruses?

The same holds for Asimov's three laws of robotics, turning these into actual math or code seems to be quite challenging.

There's likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications.

Probably obvious to many, but I'd like to point out that these automated systems themselves need to be sufficiently aligned to humans, while also accomplishing tasks that are difficult for humans to do and probably involve a lot of moral considerations.

Comment by harfe on Five neglected work areas that could reduce AI risk · 2023-09-24T03:10:24.109Z · LW · GW

A common response is that “evaluation may be easier than generation”. However, this doesn't mean evaluation will be easy in absolute terms, or relative to one’s resources for doing it, or that it will depend on the same resources as generation.

I wonder to what degree this is true for the human-generated alignment ideas that are being submitted LessWrong/Alignment Forum?

For mathematical proofs, evaluation is (imo) usually easier than generation: Often, a well-written proof can be evaluated by reading it once, but often the person who wrote up the proof had to consider different approaches and discard a lot of them.

To what degree does this also hold for alignment research?

Comment by harfe on The Dick Kick'em Paradox · 2023-09-24T01:22:10.897Z · LW · GW

The setup violates a fairness condition that has been talked about previously.

From https://arxiv.org/pdf/1710.05060.pdf, section 9:

We grant that it is possible to punish agents for using a specific decision proce- dure, or to design one decision problem that punishes an agent for rational behavior in a different decision problem. In those cases, no decision theory is safe. CDT per- forms worse that FDT in the decision problem where agents are punished for using CDT, but that hardly tells us which theory is better for making decisions. [...]

Yet FDT does appear to be superior to CDT and EDT in all dilemmas where the agent’s beliefs are accurate and the outcome depends only on the agent’s behavior in the dilemma at hand. Informally, we call these sorts of problems “fair problems.” By this standard, Newcomb’s problem is fair; Newcomb’s predictor punishes and rewards agents only based on their actions. [...]

There is no perfect decision theory for all possible scenarios, but there may be a general-purpose decision theory that matches or outperforms all rivals in fair dilem- mas, if a satisfactory notion of “fairness” can be formalized

Comment by harfe on If we had known the atmosphere would ignite · 2023-08-17T12:25:18.682Z · LW · GW

Is the organization who offers the prize supposed to define "alignment" and "AGI" or the person who claims the prize? this is unclear to me from reading your post.

Defining alignment (sufficiently rigorous so that a formal proof of (im)possibility of alignment is conceivable) is a hard thing! Such formal definitions would be very valuable by themselves (without any proofs). Especially if people widely agree that the definitions capture the important aspects of the problem.

Comment by harfe on The Learning-Theoretic Agenda: Status 2023 · 2023-07-10T15:08:04.140Z · LW · GW

I think the conjecture is also false in the case that utility functions map from to .

Let us consider the case of and . We use , where is the largest integer such that starts with (and ). As for , we use , where is the largest integer such that starts with (and ). Both and are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy that always picks action is the optimal policy for both and (see proof below).

Note that since the policy is computable and very simple, is not true, and we have instead. I suspect that the issues are still present even with an additional condition, but finding a concrete example with an uncomputable policy is challenging.

proof: Suppose that and are locally equivalent. Let be an open neighborhood of the point and , be such that for all .

Since , we have . Because is an open neighborhood of , there is an integer such that for all . For such , we have This implies . However, this is not possible for all . Thus, our assumption that and are locally equivalent was wrong.

Assumptions about the solomonoff prior: For all , the sequence of actions that produces the sequence of with the highest probability is (recall that we start with observations in this setting). With this assumption, it can be seen that the policy that always picks action is among the best policies for both and .

I think this is actually a natural behaviour for a reasonable Solomonoff prior: It is natural to expect that is more likely than . It is natural to expect that the sequence of actions that leads to over has low complexity. Always picking is low complexity.

It is possible to construct an artificial UTM that ensures that "always take " is the best policy for , : An UTM can be constructed such that the corresponding Solomonoff prior assigns 3/4 probability to the program/environment "start with o_1. after action a_i, output o_i". The rest of the probability mass gets distributed according to some other more natural UTM.

Then, for , in each situation with history the optimal policy has to pick (the actions outside of this history have no impact on the utility): With 3/4 probability it will get utility of at least . And with probability at least . Whereas, for the choice of , with probability it will have utility of , and with probability it can get at most . We calculate , ie. taking action is the better choice.

Similarly, for , the optimal policy has to pick too in each situation with history . Here, the calculation looks like .

Comment by harfe on Infra-Bayesian Logic · 2023-07-05T20:07:09.634Z · LW · GW

"inclusion map" refers to the map , not the coproduct . The map is a coprojection (these are sometimes called "inclusions", see https://ncatlab.org/nlab/show/coproduct).

A simple example in sets: We have two sets , , and their disjoint union . Then the inclusion map is the map that maps (as an element of ) to (as an element of ).

Comment by harfe on Daniel Kokotajlo's Shortform · 2023-06-19T01:06:20.783Z · LW · GW

What is an environmental subagent? An agent on a remote datacenter that the builders of the orginal agent don't know about?

Another thing that is not so clear to me in this description: Does the first agent consider the alignment problem of the environmental subagent? It sounds like the environmental subagents cares about paperclip-shaped molecules, but is this a thing the first agent would be ok with?

Comment by harfe on UK PM: $125M for AI safety · 2023-06-12T13:09:42.675Z · LW · GW

This does not sound very encouraging from the perspective of AI Notkilleveryoneism. When the announcement of the foundation model task force talks about safety, I cannot find hints that they mean existential safety. Rather, it seems about safety for commercial purposes.

A lot of the money might go into building a foundation model. At least they should also announce that they will not share weights and details on how to build it, if they are serious about existential safety.

This might create an AI safety race to the top as a solution to the tragedy of the commons

This seems to be the opposite of that. The announcement talks a lot about establishing UK as a world leader, e.g. "establish the UK as a world leader in foundation models".

Comment by harfe on Transformative AGI by 2043 is <1% likely · 2023-06-09T22:32:05.900Z · LW · GW

There is an additional problem where one of the two key principles for their estimates is

Avoid extreme confidence

If this principle leads you to picking probability estimates that have some distance to 1 (eg by picking at most 0.95).

If you build a fully conjunctive model, and you are not that great at extreme probabilities, then you will have a strong bias towards low overall estimates. And you can make your probability estimates even lower by introducing more (conjunctive) factors.

Comment by harfe on Article Summary: Current and Near-Term AI as a Potential Existential Risk Factor · 2023-06-07T23:57:28.778Z · LW · GW

Nitpick: The title the authors picked ("Current and Near-Term AI as a Potential Existential Risk Factor") seems to better represent the content of the article than the title you picked for this LW post ("The Existential Risks of Current and Near-Term AI").

Reading the title I was expecting an argument that extinction could come extremely soon (eg by chaining GPT-4 instances together in some novel and clever way). The authors of the article talk about something very different imo.

Comment by harfe on Improving Mathematical Reasoning with-Process Supervision · 2023-05-31T19:47:19.469Z · LW · GW

From just reading your excerpt (and not the whole paper), it is hard to determine how much alignment washing is going on here.

  • what is aligned chain-of-thought? What would unaligned chain-of-thought look like?
  • what exactly means alignment in the context of solving math problems?

But maybe these worries can be answered from reading the full paper...

Comment by harfe on Yoshua Bengio: How Rogue AIs may Arise · 2023-05-23T18:49:20.492Z · LW · GW

I think overall this is a well-written blogpost. His previous blogpost already indicated that he took the arguments seriously, so this is not too much of a surprise. That previous blogpost was discussed and partially criticized on Lesswrong. As for the current blogpost, I also find it noteworthy that active LW user David Scott Krueger is in the acknowledgements.

This blogpost might even be a good introduction for AI xrisk for some people.

I hope he engages further with the issues. For example, I feel like inner misalignment is still sort of missing from the arguments.

Comment by harfe on TED talk by Eliezer Yudkowsky: Unleashing the Power of Artificial Intelligence · 2023-05-09T16:47:47.174Z · LW · GW

I googled "Zeitgeist Addendum" and it does not seem to be a thing that would be useful for AGI risk.

  • is a followup movie of a 9/11 conspiracy movie
  • has some naive economic ideas (like abolishing money would fix a lot of issues)
  • the venus project appears to not be very successful

Do you claim the movie had any great positive impact or presented any new, true, and important ideas?

Comment by harfe on Annotated reply to Bengio's "AI Scientists: Safe and Useful AI?" · 2023-05-09T02:03:03.781Z · LW · GW

There is also another linkpost for the same blogpost: https://www.lesswrong.com/posts/EP92JhDm8kqtfATk8/yoshua-bengio-argues-for-tool-ai-and-to-ban-executive-ai

Comment by harfe on Yoshua Bengio argues for tool-AI and to ban "executive-AI" · 2023-05-09T01:59:52.444Z · LW · GW

There is also some commentary here: https://www.lesswrong.com/posts/kGrwufqxfsyuaMREy/annotated-reply-to-bengio-s-ai-scientists-safe-and-useful-ai

Comment by harfe on Yoshua Bengio argues for tool-AI and to ban "executive-AI" · 2023-05-09T01:57:15.817Z · LW · GW

Overall this is still encouraging. It seems to take serious that

  • value alignment is hard
  • executive-AI should be banned
  • banning executive-AI would be hard
  • alignment research and AI safety is worthwhile.

I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.

That said, I wish there were more details about his Scientist AI idea:

  • How exactly will the Scientist AI be used?
  • Should we expect the Scientist AI to have situational awareness?
  • Would the Scientist AI be allowed to write large scale software projects that are likely to get executed after a brief reviewing of the code by a human?
  • Are there concerns about Mesa-optimization?

Also it is not clear to me whether the safety is supposed to come from:

  • the AI cannot really take actions in the world (and even when there is a superhuman AI that wants to do large-scale harms, it will not succeed, because it cannot take actions that achieve these goals)
  • the AI has no intrinsic motivation for large-scale harm (while its output bits could in principle create large-scale harm, such a string of bits is unlikely because there is no drive towards these string of bits).
  • a combination of these two.
Comment by harfe on Yoshua Bengio argues for tool-AI and to ban "executive-AI" · 2023-05-09T00:29:05.984Z · LW · GW

Potentially relevant: Yoshua Bengio got funding from OpenPhil in 2017:

https://www.openphilanthropy.org/grants/montreal-institute-for-learning-algorithms-ai-safety-research/

Comment by harfe on TED talk by Eliezer Yudkowsky: Unleashing the Power of Artificial Intelligence · 2023-05-08T20:47:47.361Z · LW · GW

There is this documentary: https://en.wikipedia.org/wiki/Do_You_Trust_This_Computer%3F Probably not quite what you want. Maybe the existing videos of Robert Miles (on Mesa-Optimization and other things) would be better than a full documentary.

Comment by harfe on jacquesthibs's Shortform · 2023-05-05T20:10:19.672Z · LW · GW

Maybe something like this can be extracted from stampy.ai (I am not that familiar with stampy fyi, its aims seem to be broader than what you want.)

Comment by harfe on LessWrong exporting? · 2023-05-03T19:04:56.641Z · LW · GW

But why? And which user?

Comment by harfe on Forum Proposal: Karma Transfers · 2023-05-01T00:56:44.449Z · LW · GW

I fear that making karma more like a currency is not good for the culture on LW.

I think money would be preferable to karma bounties in most situations. An alternative for bounties could be a transfer of Mana on Manifold: Mana is already (kind of) a currency.

Comment by harfe on GPT-4 is bad at strategic thinking · 2023-03-27T15:31:59.828Z · LW · GW

Certain kinds of "thinking ahead" is difficult to do within 1 forward pass. Not impossible, and GPT-4 likely does a lot of thinking ahead within 1 forward pass.

If you have lots of training data on a game, you often can do well without thinking ahead much. But for a novel game, you have to mentally simulate a lot of options how the game could continue. For example, in Connect4, if you consider all your moves and all possible responses, these are 49 possible game states you need to consider. But with experience in this game, you learn to only consider a few of these 49 options.

Maybe this is a reason why GPT-4 is not so good when playing mostly novel strategy games.

Comment by harfe on There are no coherence theorems · 2023-02-21T03:20:06.252Z · LW · GW

The title "There are no coherence theorems" seems click-baity to me, when the claim relies on a very particular definition "coherence theorem". My thought upon reading the title (before reading the post) was something like "surely, VNM would count as a coherence theorem". I am also a bit bothered by the confident assertions that there are no coherence theorems in the Conclusion and Bottom-lines for similar reason.

Comment by harfe on There are no coherence theorems · 2023-02-21T03:13:57.109Z · LW · GW

What is the function evaluateAction supposed to do when human values contain non-consequentialist components? I assume ExpectedValue is a real number. Maybe there could be a way to build a utility function that corresponds to the code, but that is hard to judge since you have left the details out.

Comment by harfe on There are no coherence theorems · 2023-02-21T03:08:33.582Z · LW · GW

The post argues a lot against completeness. I have a hard time imagining an advanced AGI (which has the ability to self-reflect a lot) that has a lot of preferences, but no complete preferences.

Your argument seems to be something like "There can be outcomes A and B where neither nor . This property can be preserved if we sweeten A a little bit: then we have but neither nor . If faced with a decision between A and B (or faced with a choice between ), the AGI can do something arbitrary, eg just flip a coin."

I expect advanced AGI systems capable of self-reflection to think whether A or B seems to be more valuable (unless it thinks the situation is so low-stakes that it is not worth thinking about. But computation is cheap, and in AI safety we typically care about high-stakes situation anyways). To use your example: If A is a lottery that gives the agent a Fabergé egg for sure. B is a lottery that returns to the agent their long-lost wedding album, then I would expect an advanced agent to invest a bit into figuring out which of those it deems more valuable.

Also, somewhere in the weights/code of the AGI there has to be some decision procedure, that specifies what the AGI should do if faced with the choice between A and B. It would be possible to hardcode that the AGI should flip a coin when faced with a certain choice. But by default, I expect the choice between A and B to depend on some learned heuristics (+reflection) and not hardcoded. A plausible candidate here would be a Mesaoptimizer, who might have a preference between A and B even when the outer training rules don't encourage a preference between A and B.

A-priori, the following outputs of an advanced AGI seem unlikely and unnatural to me:

  • If faced with a choice between and , the AGI chooses each with
  • If faced with a choice between and , the AGI chooses each with
  • If faced with a choice between and , the AGI chooses with .
Comment by harfe on DragonGod's Shortform · 2023-02-07T22:36:58.622Z · LW · GW

Its ok, you don't have to republish it just for me. Looking forward to your finished post, its an interesting and non-obvious topic.

Comment by harfe on DragonGod's Shortform · 2023-02-07T22:00:01.362Z · LW · GW

As a commenter on that post, I wish you hadn't unpublished it. From what I remember, you had stated that it was written quickly and for that reason I am fine with it not being finished/polished. If you want to keep working on the post, maybe you can make a new post once you feel you are done with the long version.

Comment by harfe on [deleted post] 2023-02-06T20:14:34.194Z

Nice post! Here are some thoughts:

  • We do not necessarily need fully formally proven theorems, other forms of high confidence in safety could be sufficient. For example, I would be comfortable with turning on an AI that is safe iff the Goldbach conjecture is true, even if we have no proof of the Goldbach conjecture.
  • We currently don't have any idea what kind of theorems we want to prove. Formulating the right theorem is likely more difficult than proving it.
  • Theorems can rely on imperfect assumptions (that are not exact as in the real world). In such a case, it is not clear that they give us the degree of high confidence that we would like to have.
  • Theorems that rely on imperfect assumptions could still be very valuable and increase overall safety, nonetheless. For example, if we could prove something like "this software is corrigible, assuming we are in a world run by Newtonian physics" then this could (depending on the details) be high evidence for the software being corrigible in a Quantum world.
Comment by harfe on What's the simplest concrete unsolved problem in AI alignment? · 2023-01-28T21:40:41.700Z · LW · GW

yes, sorry, I meant to say the opposite. I changed it now.

Comment by harfe on What's the simplest concrete unsolved problem in AI alignment? · 2023-01-27T23:16:19.052Z · LW · GW

I have the impression that Neel Nanda means something different by the word "concrete" than agg, when agg considers problems of the type "explain something in a good way" not a concrete problem.

For example, I would think that "Hunt through Neuroscope for the toy models and look for interesting neurons to focus on." would not matcg agg's bar for concreteness. But maybe other problems from Neel Nanda might.

Comment by harfe on Thoughts on hardware / compute requirements for AGI · 2023-01-24T18:17:56.922Z · LW · GW

Nanotech industry-rebuilding comes earlier than von Neumann level? I doubt that. A lot of existing people are close to von Neumann level.

Maybe your argument is that there will be so many AGIs, that they can do Nanotech industry rebuilding while individually being very dumb. But I would then argue that the collective already exceeds von Neumann or large groups of humans in intelligence.

Comment by harfe on AGI and the EMH: markets are not expecting aligned or unaligned AI in the next 30 years · 2023-01-10T22:56:30.881Z · LW · GW

if you believe that financial markets are wrong, then you have the opportunity to (1) borrow cheaply today and use that money to e.g. fund AI safety work

How exactly would I go about doing that? A-priori this seems difficult: If there were opportunities to cheaply borrow money for eg 10 years, lots of people who have strong time discounting would take that option.

Comment by harfe on I believe some AI doomers are overconfident · 2022-12-20T19:05:50.637Z · LW · GW

I wonder if someone could create a similar structured argument for the opposite viewpoint. (Disclaimer: I do not endorse a mirrored argument of OP's argument)

You could start with "People who believe there is a >50% possibility of humanity's survival in the next 50 years or so strike me as overconfident.", and then point out that for every plan of humanity's survival, there are a lot of things that could potentially go wrong.

The analogy is not perfect, but to a first approximation, we should expect that things can go wrong in both directions.

Comment by harfe on AI can exploit safety plans posted on the Internet · 2022-12-04T14:38:21.942Z · LW · GW

I dislike the framing of this post. Reading this post made the impression that

  • You wrote a post with a big prediction ("AI will know about safety plans posted on the internet")
  • Your post was controversial and did not receive a lot of net-upvotes
  • Comments that disagree with you receive a lot of upvotes. Here you make me think that these upvoted comments disagree with the above prediction.

But actually reading the original post and the comments reveals a different picture:

  • The "prediction" was not a prominent part of your post.
  • The comments such as this imo excellent comment did not disagree with the "prediction", but other aspects of your post.

Overall, I think its highly likely that the downvotes where not because people did not believe that future AI systems will know about safety plans posted on LW/EAF, but because of other reasons. I think people were well aware that AI systems will get to know about plans for AI safety, just as I think that it is very likely that this comment itself will be found in the training data of future AI systems.

Comment by harfe on On Kelly and altruism · 2022-11-25T12:41:13.286Z · LW · GW

Yes, good catch. I edited. I made two mistakes in the above:

  • confused personal money with "altruistic money": In the beginning of the comment I assumed that all money would be donated, and none kept. By the end of my comment, my mental model has apparently shifted to also include personal money/"selfish money" (which would be justified for people to keep).

  • I included a range of numbers for the possible bet size, and thought that lower bet amounts would be justified due to diminishing returns. Checking the numbers again, the diminishing returns are not that significant (at the scale of $1B likely far below 10x), and my opinion is now that you should bet everything.

Comment by harfe on On Kelly and altruism · 2022-11-25T07:03:48.481Z · LW · GW

An assumption that seems to be present in the betting framework here is that you frequently encounter bets which have positive EV.

I think in real life, that assumption is not particularly realistic. Most people do not encounter a lot of opportunities whose EV (in money) is significantly above ordinary things such as investing in the stock market.

Suppose you have $100k and are in the situation where you only win 10% of the time, but if you do you get paid out 10,000x your bet size. But after the bet you do not expect to find similarly opportunities again and you also plan to donate everything to GiveDirectly. If you were to rank optimize, which, iiuc, mean maximizing the probability of being "the richest person in the room", then you should bet nothing, because then you have a 90% probability of being richer than the counterfactual-you who bets a fraction of the wealth. But if you care a lot about the value your donations provide to the world, then you should probably bet $40k-$100k (depending on the diminishing returns of money to GiveDirectly, or maybe valuing having a bit of money for selfish reasons). edit: But if you care a lot about the value your donations provide to the world, then you should probably bet all $100k (there are likely diminishing returns of money given to GiveDirectly, but I think the high upside of the bet outweighs the diminishment by a big margin. Also, by assumption of this thought experiment, you were not planning to keep any money for selfish purposes.).

Comment by harfe on Dumb and ill-posed question: Is conceptual research like this MIRI paper on the shutdown problem/Corrigibility "real" · 2022-11-24T13:11:39.455Z · LW · GW

This just feels like pretend, made-up research that they put math equations in to seem like it's formal and rigorous.

Can you elaborate which parts feel made-up to you? E.g.:

  • modelling a superintelligent agent as a utility maximizer
  • considering a 3-step toy model with , ,
  • assuming that a specification of exists

At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off.

The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.

I would also like to note, that the paper has many more caveats.

Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?

Comment by harfe on [deleted post] 2022-11-16T04:01:39.625Z

I fail to see how this changes the answer to the St Petersburg paradox. We have the option of 2 utility with 51% probability and 0 utility with 49% probability, and a second option of utility 1 with 100%. Removing the worst 0.5% of the probability distribution gives us a probability of 48.5% for utility 0, and removing the best 0.5% of the probability distribution gives us a probability of 50.5% for utility 2. Renormalizing so that the probabilities sum to gives us probabilities for utility , and for utility . The expected value is then still greater than . Thus we should choose the option where we have a chance at doubling utility.

Comment by harfe on Learning societal values from law as part of an AGI alignment strategy · 2022-10-21T14:46:58.197Z · LW · GW

P(misalignment x-risk | AGI that understands democratic law) < P(misalignment x-risk | AGI)

I don't think this is particularly compelling. While technically true, the difference between those probabilities is tiny. Any AGI is highly likely to understand democratic laws.

Comment by harfe on Is GPT-N bounded by human capabilities? No. · 2022-10-18T00:05:08.095Z · LW · GW

Summary of your argument: The training data can contain outputs of processes that have superhuman abilities (eg chess engines), therefore LLMs can exceed human performance.

More speculatively, there might be another source of (slight?) superhuman abilities: GPT-N could generalize/extrapolate from human abilities to superhuman abilities, if it was plausible that at some point in the future these superhuman abilities would be shown on the internet. For example, it is conceivable that GPT-N prompted with "Here is a proof of the Riemann hypothesis that has been verified extensively:" would actually a valid proof, even if a proof of the Riemann hypothesis was beyond the ability of humans in the training data.

But perhaps this is an assumption people often make about LLMs.

I think people often claim something along the lines of "GPT-8 cannot exceed human capacity" (which is technically false) to argue that a (naively) upscaled version of GPT-3 cannot reach AGI. I think we should expect that there are at least some limits to the intelligence we can obtain from GPT-8, if they just train it to predict text (and not do any amplification steps, or RL).

Comment by harfe on [deleted post] 2022-10-17T03:06:28.603Z

Because it was not trained using reinforcement learning and doesn't have a utility function, which means that it won't face problems like mesa-optimisation

I think this is at least a non-obvious claim. In principle, it is conceivable that mesa-optimisation can occur outside of RL. There could be an agent/optimizer in (highly advanced, future) predictive models, even if the system does not really have a base objective. In this case, it might be better to think in terms of training stories rather than inner+outer alignment. Furthermore, there could still be issues with gradient hacking.

Comment by harfe on Lessons learned from talking to >100 academics about AI safety · 2022-10-10T15:44:42.911Z · LW · GW

Great post! I agree that academia is a resource that could be very useful for AI safety.

There are a lot of misunderstandings around AI safety and I think the AIS community has failed to properly explain the core ideas to academics until fairly recently. Therefore, I often encountered confusions like that AI safety is about fairness, self-driving cars and medical ML.

I think these misunderstandings are understandable based on the term "AI safety". Maybe it would be better to call the field AGI safety or AGI alignment? This seems to me like a more honest description of the field.

You also write that you find it easier to not talk about xrisk. If we avoid talking about xrisk while presenting AI safety, then some misunderstandings about AI safety will likely persist in the future.