Coordination and well-scaling projects 2023-09-28T08:01:52.080Z
Advanced AI can beat humanity 2023-04-02T10:59:34.174Z
Loppukilpailija's Shortform 2023-03-23T10:59:28.310Z
Language models are not inherently safe 2023-03-07T21:15:08.595Z
Takeaways from calibration training 2023-01-29T19:09:30.815Z


Comment by Loppukilpailija (jarviniemi) on Loppukilpailija's Shortform · 2023-09-26T09:57:19.199Z · LW · GW

Devices and time to fall asleep: a small self-experiment

I did a small self-experiment on the question "Does the use of devices (phone, laptop) in the evening affect the time taken to fall asleep?".


On each day during the experiment I went to sleep at 23:00. 

At 21:30 I randomized what I'll do at 21:30-22:45. Each of the following three options was equally likely:

  • Read a physical book
  • Read a book on my phone
  • Read a book on my laptop

At 22:45-23:00 I brushed my teeth etc. and did not use devices at this time.

Time taken to fall asleep was measured by a smart watch. (I have not selected it for being good to measure sleep, though.) I had blue light filters on my phone and laptop.


I ran the experiment for n = 17 days (the days were not consecutive, but all took place in a consecutive ~month).

I ended up having 6 days for "phys. book", 6 days for "book on phone" and 5 days for "book on laptop".

On one experiment day (when I read a physical book), my watch reported me as falling asleep at 21:31. I discarded this as a measuring error.

For the resulting 16 days, average times to fall asleep were 5.4 minutes, 21 minutes and 22 minutes, for phys. book, phone and laptop, respectively.

[Raw data:

Phys. book: 0, 0, 2, 5, 22

Phone: 2, 14, 21, 24, 32, 33

Laptop: 0, 6, 10, 27, 66.]


The sample size was small (I unfortunately lost the motivation to continue). Nevertheless it gave me quite strong evidence that being on devices indeed does affect sleep.

Comment by Loppukilpailija (jarviniemi) on Loppukilpailija's Shortform · 2023-09-22T07:21:24.613Z · LW · GW

Iteration as an intuition pump

I feel like many game/decision theoretic claims are most easily grasped when looking at the iterated setup:

Example 1. When one first sees the prisoner's dilemma, the argument that "you should defect because of whatever the other person does, you are better off by defecting" feels compelling. The counterargument goes "the other person can predict what you'll do, and this can affect what they'll play".

This has some force, but I have had a hard time really feeling the leap from "you are a person who does X in the dilemma" to "the other person models you as doing X in the dilemma". (One thing that makes this difficult that usually in PD it is not specified whether the players can communicate beforehand or what information they have of each other.) And indeed, humans models' of other humans are limited - this is not something you should just dismiss.

However, the point "the Nash equilibrium is not necessarily what you should play" does hold, as is illustrated by the iterated Prisoner's dilemma. It feels intuitively obvious that in a 100-round dilemma there ought to be something better than always defecting.

This is among the strongest intuitions I have for "Nash equilibria do not generally describe optimal solutions".


Example 2. When presented with lotteries, i.e. opportunities such as "X% chance you win A dollars, (100-X)% chance of winning B dollars", it's not immediately obvious that one should maximize expected value (or, at least, humans generally exhibit loss aversion, bias towards certain outcomes, sensitivity to framing etc.).

This feels much clearer when given the option to choose between lotteries repeatedly. For example, if you are presented with the two buttons, one giving you a sure 100% chance of winning 1 dollar and the other one giving you a 40% chance of winning 3 dollars, and you are allowed to press the buttons a total of 100 times, it feels much clearer that you should always pick the one with the highest expected value. Indeed, as you are given more button presses, the probability of you getting (a lot) more money that way tends to 1 (by the law of large numbers).

This gives me a strong intuition that expected values are the way to go.

Example 3. I find Newcomb's problem a bit confusing to think about (and I don't seem to be alone in this). This is, however, more or less the same problem as prisoner's dilemma, so I'll be brief here.

The basic argument "the contents of the boxes have already been decided, so you should two-box" feel compelling, but then you realize that in an iterated Newcomb's problem you will, by backward induction, always two-box.

This, in turn, sounds intuitively wrong, in which case the original argument proves too much. 

One thing I like about iteration is that it makes the concept of ""it really is possible to make predictions about your actions" feel more plausible: there's clear-cut information about what kind of plays you'll make, namely the previous rounds. I feel like in my thoughts I sometimes feel like rejecting the premise, or thinking that "sure, if the premise holds, I should one-box, but it doesn't really work that way in real life, this feels like one of those absurd thought experiments that don't actually teach you anything". Iteration solves this issue.

Another pump I like is "how many iterations do there need to be before you Cooperate/maximize-expected-value/one-box?". There (I think) is some number of iterations for this to happen, and, given that, it feels like "1" is often the best answer.

All that said, I don't think iterations provide the Real Argument for/against the position presented. There's always some wiggle room for "but what if you are not in an iterated scenario, what if this truly is a Unique Once-In-A-Lifetime Opportunity?". I think the Real Arguments are something else - e.g. in example 2 I think coherence theorems give a stronger case (even if I still don't feel them as strongly on an intuitive level). I don't think I know the Real Argument for example 1/3.

Comment by Loppukilpailija (jarviniemi) on Rational Agents Cooperate in the Prisoner's Dilemma · 2023-09-02T12:44:52.573Z · LW · GW

Well written! I think this is the best exposition to non-causal decision theory I've seen. I particularly found the modified Newcomb's problem and the point it illustrates in the "But causality!" section to be enlightening.

Comment by Loppukilpailija (jarviniemi) on How to decide under low-stakes uncertainty · 2023-08-12T00:04:50.437Z · LW · GW

How I generate random numbers without any tools: come up with a sequence of ~5 digits, take their sum and look at its parity/remainder. (Alternatively, take ~5 words and do the same with their lengths.) I think I'd pretty quickly notice a bias in using just a single digit/word, but taking many of them gives me something closer to a uniform distribution.

Also, note that your "More than two options" method is non-uniform when the number of sets is not a power of two. E.g. with three sets the probabilities are 1/2, 1/4 and 1/4.

Comment by Loppukilpailija (jarviniemi) on Loppukilpailija's Shortform · 2023-08-07T12:38:58.687Z · LW · GW

Epistemic responsibility

"You are responsible for you having accurate beliefs."

Epistemic responsibility refers to the idea that it is on you to have true beliefs. The concept is motivated by the following two applications.


In discussions

Sometimes in discussions people are in a combative "1v1 mode", where they try to convince the other person of their position and defend their own position, in contrast to a cooperative "2v0 mode" where they share their beliefs and try to figure out what's true. See the soldier mindset vs. the scout mindset.

This may be framed in terms of epistemic responsibility: If you accept that "It is (solely) my responsibility that I have accurate beliefs", the conversation naturally becomes less about winning and more about having better beliefs afterwards. That is, a shift from "darn, my conversation partner is so wrong, how do I explain it to them" to "let me see if the other person has valuable points, or if they can explain how I could be wrong about this".

In particular, from this viewpoint it sounds a bit odd if one says the phrase "that doesn't convince me" when presented with an argument, as it's not on the other person to convince you of something. 


Note: This doesn't mean that you have to be especially cooperative in the conversation. It is your responsibility that you have true beliefs, not that you both have. If you end up being less wrong, success. If the other person doesn't, that's on them :-)


Trusting experts

There's a question Alice wants to know the answer to. Unfortunately, the question is too difficult for Alice to find out the answer herself. Hence she defers to experts, and ultimately believes what Bob-the-expert says.

Later, it turns out that Bob was wrong. How does Alice react?

A bad reaction is to be angry at Bob and throw rotten tomatoes at him.

Under the epistemic responsibility frame, the proper reaction is "Huh, I trusted the wrong expert. Oops. What went wrong, and how do I better defer to experts next time?"


When (not) to use the frame

I find the concept to be useful when revising your own beliefs, as in the above examples of discussions and expert-deferring.

One limitation is that belief-revising often happens via interpersonal communication, whereas epistemic responsibility is individualistic. So while "my aim is to improve my beliefs" is a better starting point for conversations than "my aim is to win", this is still not ideal, and epistemic responsibility is to be used with a sense of cooperativeness or other virtues.


Another limitation is that "everyone is responsible for themselves" is a bad norm for a community/society, and this is true of epistemic responsibility as well.

I'd say that the concept of epistemic responsibility is mostly for personal use. I think that especially the strongest versions of epistemic responsibility (heroic epistemic responsibility?), where you are the sole person responsible for you having true beliefs and where any mistakes are your fault, are something you shouldn't demand of others. For example, I feel like a teacher has a lot of epistemic responsibility on the behalf of their students (and there are other types of responsibilities going on here).

Or whatever, use it how you want - it's on you to use it properly.

Comment by Loppukilpailija (jarviniemi) on Survey on intermediate goals in AI governance · 2023-05-26T12:30:10.634Z · LW · GW

This survey is really good!

Speaking as someone who's exploring the AI governance landscape: I found the list of intermediate goals, together with the responses, a valuable compilation of ideas. In particular it made me appreciate how large the surface area is (in stark contrast to takes on how progress in technical AI alignment doesn't scale). I would definitely recommend this to people new to AI governance.

Comment by Loppukilpailija (jarviniemi) on The Office of Science and Technology Policy put out a request for information on A.I. · 2023-05-25T09:30:31.449Z · LW · GW

For coordination purposes, I think it would be useful for those who plan on submitting a response mark that they'll do so, and perhaps tell a little about the contents of their response. It would also be useful for those who don't plan on responding to explain why not.

Comment by Loppukilpailija (jarviniemi) on [Linkpost] "Governance of superintelligence" by OpenAI · 2023-05-23T08:10:34.895Z · LW · GW

The last paragraph stood out to me (emphasis mine).

Second, we believe it would be unintuitively risky and difficult to stop the creation of superintelligence. Because the upsides are so tremendous, the cost to build it decreases each year, the number of actors building it is rapidly increasing, and it’s inherently part of the technological path we are on, stopping it would require something like a global surveillance regime, and even that isn’t guaranteed to work. So we have to get it right.

There are efforts in AI governance that definitely don't look like "global surveillance regime"! Taking the part above at face value, the authors seem to think that such efforts are not sufficient. But earlier on the post they talk about useful things that one could do in the AI governance field (lab coordination, independent IAEA-like authority), so I'm left confused about the authors' models of what's feasible and what's not.

The passage also makes me worried that the authors are, despite their encouragement of coordination and audits, skeptical or even opposing to efforts to stop building dangerous AIs. (Perhaps this should have already been obvious from OpenAI pushing the capabilities frontier, but anyways.)

Comment by Loppukilpailija (jarviniemi) on Daniel Kokotajlo's Shortform · 2023-04-30T19:42:21.418Z · LW · GW

Regarding betting odds: are you aware of this post? It gives a betting algorithm that satisfies both of the following conditions:

  • Honesty: participants maximize their expected value by being reporting their probabilities honestly.
  • Fairness: participants' (subjective) expected values are equal.

The solution is "the 'loser' pays the 'winner' the difference of their Brier scores, multiplied by some pre-determined constant C". This constant C puts an upper bound on the amount of money you can lose. (Ideally C should be fixed before bettors give their odds, because otherwise the honesty desideratum above could break, but I don't think that's a problem here.)

Comment by Loppukilpailija (jarviniemi) on Loppukilpailija's Shortform · 2023-04-30T19:34:49.351Z · LW · GW

On premature advice

Here's a pattern I've recognized - all examples are based on real events.

Scenario 1. Starting to exercise

Alice: "I've just started working out again. I've been doing blah for X minutes and then blah blah for Y minutes."

Bob: "You shouldn't exercise like that, you'll injure yourself. Here's what you should be doing instead..."

Result: Alice stops exercising.

Scenario 2. Starting to invest

Alice: "Everyone around me tells that investing is a good idea, so I'm now going to invest in index funds."

Bob: "You better know what you are doing. Don't invest any money you cannot afford to lose, Past Performance Is No Guarantee of Future Results, also [speculation] so this might not be the best time to invest, also..."

Result: Alice doesn't invest any of her money anywhere

Scenario 3. Buying lighting

Alice: "My current lighting is quite dim, I'm planning on buying more and better lamps."

Bob: "Lighting is complicated: you have to look at temperatures and color reproduction index, make sure to have shaders, also ideally you have colder lighting in the morning and warmer in the evening, and..."

Result: Alice doesn't improve her lighting.

I think this pattern, namely overwhelming a beginner with technical nuanced advice (that possibly was not even asked for), is bad, and Bobs shouldn't do that.

An obvious improvement is to not be as discouraging as Bob in the examples above, but it's still tricky to actually make things better instead of demotivating Alice.

When I'm Alice, I often just want to share something I've been thinking about recently, and maybe get some encouragement. Hearing Bob tell me how much I don't know doesn't make me go learn about the topic (that's a fabricated option), it makes me discouraged and possibly give up.

My memories of being Bob are not as easily accessible, but I can guess what it's like. Probably it's "yay, Alice is thinking about something I know about, I can help her!", sliding into "it's fun to talk about subjects I know about" all the way to "you fool, look how much more I know than you". 

What I think Bob should do, and what I'll do when encountering an Alice, is to be more supportive and perhaps encourage them to talk more about the thing they seem to want to talk about. 

Comment by Loppukilpailija (jarviniemi) on Contra Yudkowsky on Doom from Foom #2 · 2023-04-27T09:29:16.647Z · LW · GW

I feel like the post proves too much: it gives arguments for why foom is unlikely, but I don't see arguments which break the symmetry between "humans cannot foom relative to other animals" and "AI cannot foom relative to humans".* For example, the statements

brains are already reasonably pareto-efficient 


Intelligence requires/consumes compute in predictable ways, and progress is largely smooth.

seem irrelevant or false in light of the human-chimp example. (Are animal brains pareto-efficient? If not, I'm interested in what breaks the symmetry between humans and other animals. If yes, pareto-efficiency doesn't seem that useful for making predictions on capabilities/foom.)

*One way to resolve the situation is by denying that humans foomed (in a sense relevant for AI), but this is not the route taken in the post.

Separately, I disagree with many claims and the overall thrust in the discussion of AlphaZero.

Go is extremely simple [...] This means that the Go predictive capability of a NN model as a function of NN size completely flatlines at an extremely small size.

This seems unlikely to me, depending on what "completely flatlines" and "extremely small size" mean.

Games like Go or chess are far too small for a vast NN like the brain, so the vast bulk of its great computational power is wasted.

Go and chess being small/simple doesn't seem like the reason why ANNs are way better than brains there. Or, if it is, we should see the difference between ANNs and brains shrinking as the environment gets larger/more complex. This model doesn't seem to lead to good predictions, though: Dota 2 is a lot more complicated than Go and chess, and yet we have superhuman performance there. Or how complicated exactly does a task need to be before ANNs and brains are equally good?

(Perhaps relatedly: There seems to be an implicit assumption that AGI will be an LLM. "The AGI we actually have simply reproduces [cognitive biases], because we train AI on human thoughts". This is not obvious to me - what happened to RL?)

On a higher level, the whole train of reasoning reads like a just-so story to me: "We have obtained superhuman performance in Go, but this is only because of training on vastly more data and the environment being simple. As the task gets more complicated the brain becomes more competitive. And indeed, LLMs are close to but not quite human intelligences!". I don't see this is as a particularly good fit to the datapoints, or how this hypothesis is likelier than "There is room above human capabilities in ~every task, and we have achieved superhuman abilities in some tasks but not others (yet)".

Comment by Loppukilpailija (jarviniemi) on Arguments about fast takeoff · 2023-04-06T14:36:33.802Z · LW · GW

My thoughts on the "Humans vs. chimps" section (which I found confusing/unconvincing):

Chimpanzees have brains only ~3x smaller than humans, but are much worse at making technology (or doing science, or accumulating culture…). If evolution were selecting primarily or in large part for technological aptitude, then the difference between chimps and humans would suggest that tripling compute and doing a tiny bit of additional fine-tuning can radically expand power, undermining the continuous change story.

But chimp evolution is not primarily selecting for making and using technology, for doing science, or for facilitating cultural accumulation.

For me the main takeaway of the human vs. chimp story to be information about the structure of mind space, namely that there are discontinuities in terms of real world consequences. 

Evolution changes continuously on the narrow metric it is optimizing, but can change extremely rapidly on other metrics. For human technology, features of the technology that aren’t being optimized change rapidly all the time. When humans build AI, they will be optimizing for usefulness, and so progress in usefulness is much more likely to be linear.

I don't see how "humans are optimizing AI systems for usefulness" undermines the point about mind space - if there are discontinuities in capabilities / resulting consequences, I don't see how optimizing for capabilities / consequences makes things any more continuous. 

Also, there is a difference between "usefulness" and (say) "capability of causing human extinction", just as there is a difference between "inclusive genetic fitness" and "intelligence". Cf. it being hard to get LLMs do what you want them to do, and the difference between the publicity* of ChatGPT and other GPT-3 models is more about usability and UI instead of the underlying capabilities.

*Publicity is a different thing from usefulness. Lacking a more narrow definition of usefulness, I still would argue that to many people ChatGPT is more useful than other GPT models.

Comment by Loppukilpailija (jarviniemi) on Against an AI Research Moratorium · 2023-03-31T22:23:54.255Z · LW · GW

Our planet is full of groups of power-seekers competing against each other. Each one of them could cooperate (join in the moratorium) defect (publicly refuse) or stealth-defect (proclaim that they're cooperating while stealthily defecting). The call for a moratorium amounts to saying to every one of those groups "you should choose to lose power relative to those who stealth-defect". It doesn't take much decision theory to predict that the result will be a covert arms race conducted in a climate of fear by the most secretive and paranoid among the power groups.


There seems to be an underlying assumption that the number of stealth-defecting AI labs doing GPT-4-level training runs is non-zero. This is a non-trivial claim and I'm not sure I agree. My impression is that there are few AI labs world-wide that are capable of training such models in the next 6-12 months and we more or less know what they are.

I also disagree with the framing of stealth-defection of being a relatively trivial operation which is better than cooperation, mostly because training such models takes a lot of people (just look at pages 15-17 in the GPT-4 paper!) and thus the probability of someone whistleblowing is large.

And for what it's worth, I would really have hoped that such things are discussed in a post that starts with a phrase of the form "All the smart people [...] seem to have unaccountably lost their ability to do elementary game theory".

Comment by Loppukilpailija (jarviniemi) on Loppukilpailija's Shortform · 2023-03-23T10:59:28.548Z · LW · GW

Inspired by the "reward chisels cognition into the agent's network" framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?

I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were "yep, suitable reward/loss (and datapoints in the case of supervised learning) are enough". 

I was hoping for this not to be the case, as that would have been more interesting (imagine if there were fundamental limitations to the reward/loss paradigm!), but anyways. I now expect that also in more complicated situations reward/loss are, in principle, enough.

Example 1: Q-learning. You have a set  of states and a set  of actions. Given a target policy , can you necessarily choose a reward function  such that, training for long enough* with Q-learning (with positive learning rate and discount factor), the action that maximizes reward is the one given by the target policy: ?

*and assuming we visit all of the states in  many times

The answer is yes. Simply reward the behavior you want to see: let  if  and  otherwise.

(In fact, one can more strongly choose, for any target value function  , a reward function  such that the values  in Q-learning converge in the limit to . So not only can you force certain behavior out of the model, you can also choose the internals.)

Example 2: Neural network.

Say you have a neural network  with  tunable weights . Can you, by suitable input-output pairs and choices of the learning rate, modify the weights of the net so that they are (approximately) equal to ?

(I'm assuming here that we simply update the weights after each data point, instead of doing SGD or something. The choice of loss function is not very relevant, take e.g. square-error.)

The following sketch convinces me that the answer is positive:

Choose  random input-output pairs . The gradients  of the weight vectors are almost certainly linearly independent. Hence, some linear combination  of them equals . Now, for small , running back-propagation on the pair  with learning rate  for all  gives you an update approximately in the direction of . Rinse and repeat.

Comment by Loppukilpailija (jarviniemi) on Open & Welcome Thread — February 2023 · 2023-02-17T10:01:51.746Z · LW · GW

Feature suggestion: Allow one to sort a user's comments by the number of votes.

Context: I saw a comment by Paul Christiano, and realized that probably a significant portion of the views expressed by a person lie in comments, not top-level posts. However, many people (such as Christiano) have written a lot of comments, so sorting them would allow one to find more valuable comments more easily.

Comment by Loppukilpailija (jarviniemi) on Petition - Unplug The Evil AI Right Now · 2023-02-15T22:50:42.371Z · LW · GW

Note that you can still retract your signature for 30 days after signing. See here:

Comment by Loppukilpailija (jarviniemi) on Is InstructGPT Following Instructions in Other Languages Surprising? · 2023-02-14T13:50:36.200Z · LW · GW

Ah, I misunderstood the content of original tweet - I didn't register that the model indeed had access to lots of data in other languages as well. In retrospect I should have been way more shocked if this wasn't the case. Thanks.

I then agree that it's not too surprising that the instruction-following behavior is not dependent on language, though it's certainly interesting. (I agree with Habryka's response below.)

Comment by Loppukilpailija (jarviniemi) on Is InstructGPT Following Instructions in Other Languages Surprising? · 2023-02-14T11:40:27.910Z · LW · GW

I feel like this answer glosses over the fact that the encoding changes. Surely you can find some encodings of instructions such that LLMs cannot follow instructions in that encoding. So the question lies in why learning the English encoding also allows the model to learn (say) German encodings.

Comment by Loppukilpailija (jarviniemi) on Language models can generate superior text compared to their input · 2023-01-17T14:03:02.139Z · LW · GW

The fair-goers, having knowledge of oxen, had no bias in their guesses


[EDIT: I read this as "having no knowledge of oxen" instead of "having knowledge of oxen" - is this what you meant? The comment seems relevant nevertheless.]

This does not follow: It is entirely possible that the fair-goers had no specific domain knowledge of oxen, while still having biases arising from domain-general reasoning. And indeed, they probably knew something about oxen -- from Jaynes' Probablity Theory:

The absurdity of the conclusion [that polling billion people tells the height of China's emperor with accuracy 0.03 mm] tells us rather forcefully that the √N rule is not always valid, even when the separate data values are causally independent; it is essential that they be logically independent. In this case, we know that the vast majority of the inhabitants of China have never seen the Emperor; yet they have been discussing the Emperor among themselves, and some kind of mental image of him has evolved as folklore. Then, knowledge of the answer given by one does tell us something about the answer likely to be given by another, so they are not logically independent. Indeed, folklore has almost surely generated a systematic error, which survives the averaging; thus the above estimate would tell us
something about the folklore, but almost nothing about the Emperor.

Comment by Loppukilpailija (jarviniemi) on Existential AI Safety is NOT separate from near-term applications · 2022-12-13T18:41:10.569Z · LW · GW

Minor suggestion: I would remove the caps from the title. Reason: I saw this linked below Christiano's post, and my snap reaction was that the post is [angry knee-jerk response to someone you disagree with] rather than [thoughtful discussion and disagreement]. Only after introspection did I read this post.

Comment by Loppukilpailija (jarviniemi) on Does a LLM have a utility function? · 2022-12-09T18:10:36.475Z · LW · GW

I found janus's post Simulators to address this question very well. Much of AGI discussion revolves around agentic AIs (see the section Agentic GPT for discussion of this), but this does not model large language models very well. janus suggests that one should instead think of LLMs such as GPT-3 as "simulators". Simulators are not very agentic themselves or well described as having a utility function, though they may create simulacra that are agentic (e.g. GPT-3 writes a story where the main character is agentic).

Comment by Loppukilpailija (jarviniemi) on The Goldbach conjecture is probably correct; so was Fermat's last theorem · 2022-12-07T13:54:47.035Z · LW · GW

A couple of examples from quadratic residue patterns modulo primes:

Example 1. Let  be a large prime. How many elements  are there such that both  and  are quadratic residues?

Since half of elements mod  are quadratic residues and the events " is a QR" and " is a QR" look like they are independent, a reasonable guess is . This is the correct main term, but what about the error? A natural square-root error term is not right: one can show that the error is , the error depending only on whether  is  or  mod . (The proof is by elementary manipulations with the Legendre symbol, see here. So there's hidden structure that makes the error smaller than what a naive randomness heuristic suggests.)

Example 2. Let  be a large prime. How many elements  are such that all of  and  are quadratic residues?

Again, the obvious guess for the main term is correct (there are roughly  such ), so consider the error term. The error is not  this time! The next guess is a square-root error term. Perhaps as  ranges over the primes, the error term is (after suitable normalization) normally distributed (as is motivated by the central limit theorems)? This is not correct either!

The error is in fact bounded in absolute value by , following from a bound on the number of points on elliptic curves modulo . And for the distribution of the error, if one normalizes the error by dividing by  (so that the resulting value is in ), the distribution behaves like , where  is uniformly distributed on  as  ranges over the primes (see here). This is a deep result, which is not easy to motivate in a couple of sentences, but again there's hidden structure that the naive randomness heuristic does not account for.

(And no, one does not get normal distribution for longer streaks of quadratic residues either.) 

Comment by Loppukilpailija (jarviniemi) on How I Formed My Own Views About AI Safety · 2022-12-02T19:25:22.111Z · LW · GW

Truth-tracking - having an impact is hard! It’s really important to have true beliefs, and the best way to find them is by trying hard to form your own views and ensuring they correlate with truth. It’s easy to get deferring wrong if you trust the wrong people.


There's another interpretation of  "truth-tracking" where forming an inside view is important: It's easier to notice when you are wrong. In other words, even if you defer to the right person, it might be hard to notice when they are wrong (unless you have a very deep understanding of their views).

This seems like a more important reason than the "deferring to the wrong people" issue: new progress in AI and on the theoretical side call for continuously updating models, so you want to reduce friction on that.

Comment by Loppukilpailija (jarviniemi) on Where I agree and disagree with Eliezer · 2022-06-21T19:53:15.212Z · LW · GW

I found this post very useful! I went through the list and wrote down my thoughts on the points, posting them here in case they are of interest to others.


Some high-level comments first.

Disclaimer: I'm not senior enough to have consistent inside-views. I wrote up a similar list a few days ago in response to Yudkowsky's post, and some of my opinions have changed.

In particular, I note that I have been biased to agree with Yudkowsky for reasons unrelated to actual validity of arguments, such as "I have read more texts by him than any other single person".

So despite nitpicking about some points, the post was very useful, causing me to update my views on some issues towards Christiano's.


I agree more or less with all of the points where Christiano agrees with Yudkwosky. Point 7 seems to put more weight on "humans let AI systems control killer robots and this is relevant" than I would.

On the disagreements:

1. I appreciate the point and agree that much of the issue here is institutional. Upon consideration I've updated on "traditional R&D is a useful source of information", though I feel like this is a smaller part than Christiano. I believe this stems from me thinking "we really need theoretical foundations and guarantees when entering superhuman level".

2. I realize that I have confused "a powerful enough AI system could build nanotechnology" with "in practice a major threat scenario is nanotechnology". I have seen statements of the first type (though am unable to evaluate this issue myself), and less of the second type. I agree with Christiano that this is far from the likeliest scenario in practice, and rather should only be thought as an explanation for why strategies of form "keep the AI isolated from Internet/people/etc." fail.

3. I am confused with "how impressive this looks" being used as a proxy for "how dangerous this is". Certainly nanotechnology is both impressive and dangerous, but I am wary about making generalizations from this argument.

4. Boils down to takeoff speeds. In a weak form "AI systems will get gradually better at improving themselves" is likely true.

5. The terminology "pivotal act" is quite suggestive indeed, pointing at a cluster of solutions that is not the whole set of solutions. It is not at all obvious to me that most of the worlds where we survive arise from paths which one associates with the phrase "pivotal act".

6. The point made seems valuable. No opinion here, have to think about this more.

7. On "we are ... approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas...", my feeling is that current systems lack a good world model, this restricts alignment work, and progress on world models is progress on capabilities. I don't see the relevance to recursive self-improvement - in the worst cases self-improvement is faster than systems with humans in the loop.

8. No opinion here.

9. No opinion here.

10. No opinion here, should read the debates.

11. "based largely on his own experiences working on the problem" is a bit unfair - there of course are arguments for why alignment is hard, and one should focus on the validity of those arguments. (This post of course does that, which is good.)

12. No opinion here, but appreciate the point.

13. No opinion here (too meta for me to have an informed view), but appreciate the point.

14. Personal experience: it was only until I read the recent post on the interpretability tech tree that I understood how interpretability could lead to actual change in existential risk.

15. I agree that taken literally point 11 in list of lethalities does not hold up - I don't see any particularly strong reason why we couldn't build narrow systems aimed at building nanotechnology. I understood a part of the point in 11 was that general systems can do things which are far out of distribution, and you cannot hope that they are aligned there, which seems much more defensible.

16. Good to point out that you can study deceptive behavior in weaker systems as well (not to say that new problems couldn't appear later on).

17. A fair point.

18. No opinion here, don't have an informed view.

19. I agree. It seems that the argument given by Yudkowsky is "a sufficiently strong AGI could write down a 'pivotal act' which looks good to humans but which is actually bad", which is true, but this doesn't imply "it is not possible to build an AI that outputs a pivotal act which is good and which humans couldn't have thought of". (Namely, if you can make the AI "smart-but-not-too-smart" in some sense.)

20. No opinion here.

21. Fair point.

22. No opinion here, but again appreciate the point. (The thought about "AI learns a lot from humans because of the feedback loop" in particular was interesting and new to me.)

23. Agree with the first sentence.

24. No opinion here.

25. This is a great point: it is novel (to me), relevant and seems correct. A slight update towards optimism for me.

26. I'm as well confused about what kind of plan we should have (in a way which is distinct from "we should have more progress on alignment").

27. No opinion here.

Comment by Loppukilpailija (jarviniemi) on AGI Ruin: A List of Lethalities · 2022-06-08T23:03:10.345Z · LW · GW

Here is my honest reaction as another data point. (Well done by the parent for taking the initiative!)

Context: Got introduced to this field around a year ago. Not an expert.

My honest reaction is rather worried as well (to put it mildly).

1. I agree with this. My impression is that in many tasks we currently require a lot more data than humans, but I do not see any reason to expect that it will always be so.

2. I broadly agree with this. I am sympathetic to people who would like to see more of concrete stories about how exactly an AGI would take over the world (while there are some already, more wouldn't hurt). Meanwhile,

-  I believe that if effort is put into inventing such takeover scenarios, then one expects to come up with quite many of them. Hence, update already.

- I haven't looked into nanobots myself, so no inside view there, but my prior is definitely on "there are lots of (causally) powerful technologies we haven't invented yet".

- The AI box experiment really feels like strong empirical evidence for the bootstrapping argument

3. I agree with this as stated. I do wonder, though, whether we will get any warning shots, where we operate at a semi-dangerous level and fail. This seems to reduce to slow vs. fast takeoff. (I don't have a consistent opinion on that.)

4. Agree that there is a time limit. And indeed, recognition of the issue and cooperation from the relevant actors seems non-ideal.

5. Agree.

6. I'm not sure here - I agree that we should avoid the situation where we have multiple AGIs. If "pivotal act" is defined as an act which results in this outcome, then there is agreement, but as someone pointed out, it might be that the pivotal act is something which doesn't fit the mental picture one associates with the words "pivotal act".

7. I notice I am confused here: I'm not sure what "pivotal weak act" means, or what "something weak enough with an AGI to be *passively safe*" means. I agree with "no one knows of any pivotal act you could do with just current SOTA AI". I don't have good intuitions about the space of pivotal actions - I haven't thought about it.

8. I interpret "problems we want an AI to solve" means problems relevant for pivotal acts. In this case, see above - I don't have intuitions about pivotal acts.

9. See above.

10. Broadly agree.

11. Again, don't know much about pivotal acts. (It is mentioned that "Pivotal weak acts like this aren't known, and not for want of people looking for them." - have I missed some big projects on pivotal acts.)

12. Agree.

13. Agree.

14. Agree. The discontinuity / "treacherous turn" seems obvious to me when thought about from first principles. The skeptic voice in my head says that nothing like that has happened in practice (to my knowledge), but that really does not assure me.

15. Broadly agree, though I lack good examples for the concept "alignment-required invariants". My best guess: there is interpretability research on neural networks, and we have some non-trivial understanding there. That might turn out to be not relevant in case of a great new idea for capabilities.

16. I agree that the concept of inner alignment is important. There is an empirical verification for it. I am unsure about how big of a problem this will be in practice. I do appreciate the point about evolution.

17. I like this formulation (quite crisp), I don't think I've seen it anywhere before. To me, it seems like an interesting idea to try to come up with ways for getting inner properties to systems.

18. Agree.

19. Agree.

20. Agree, except I don't understand what "If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them." means.

21. Not sure I get the central result, but I get the idea that in capabilities you have feedback loops in a different way from utility functions.

22. Agree.

23. A good, crisp formulation. Agree.

24. A good distinction, sure. In other words "let the AGI optimize something, no strings attached (and choose that "something" very carefully)" vs. "try to control/restrict the AGI". I'm wondering whether there are any alternatives.

25. "We've got no idea" seems to me like a bit of an exaggeration, but I agree with the latter sentence.

26. Yep.

27. Yep, an instance of Goodhart's law.

28. Yep.

29. I agree that this is the generic case - if you take a complex action sequence of AGI by random, it is almost surely uninterpretable by humans. Not sure what would happen if you optimized for plans in which humans are confident they understand the consequences. Sure, we have to fight against Goodhart's law, and I do think that against sufficiently powerful cognitive systems our chances would be slim, but I'm not sure that one couldn't extract enough information to perform a pivotal act. Failure at AI boxing does seem like a major bottleneck, though.

30. I agree up to "it knows ... that some action sequence results in the world we want". I also agree that if we knew how an AI would behave in advance, it would be less intelligent than a human. I feel like there is a gap to moving that there is __no__ pivotal output of an AGI. If I am stuck in a maze and build an AGI to help me find the way out, I cannot anticipate what exact path it will give me, but I can check whether the path leads out or not. So I think the general claim "there is no pivotal output ... that is humanly checkable" is not properly justified here. I do feel like this would be the generic case, though, namely that the AGI could convince us of a plan and sneak in unintended consequences.

31. Agree. Seems conceptually related to 17: 17 is about affecting the inner properties of the system, 31 is about inspecting the inner properties.

32. Interesting point I haven't seen elsewhere, namely "Words are not an AGI-complete data representation in its native style". Not sure if it makes sense to give "true/false" status to the claim, but it pushes me a non-zero amount to the direction "alignment is hard".

33. Agree. This is a statement which I could see many educated people nodding at, but which at least I find quite hard to feel on a gut level. (The Sequences contain helpful material on this, and apparently reading the right science fiction books would also help.)

34. Agree.

35. Agree. I guess there is also the scenario where one AGI has a decisive advantage over the other, but the outcome is the same: you cannot keep the AGIs in line by pitting them against each other.

36. Agree with the bolded part, the AI-box experiment is more than enough evidence for this.

37. Agree with "in the case of AGI safety, it is really important to have conservation of expected evidence about the difficulty of alignment".

38. It does seem to me that "AGI safety" is a quite small subfield of "AI safety", or you can see these as separate fields. I agree that the incentives are not in our/humanity's favor.

39. I like this paragraph. I could nitpick about how the point of community building is that not everyone has to figure things out from the null string, but on the other hand I understand the view expressed here.

40. I have no clear view about how different the skills required for alignment are in contrast to more usual cognitively demanding work (other than that it is, well, hard). (I realize that I am biased - I found myself agreeing with "AGI risk is real" without much friction, but there are definitely many people who do not come to this conclusion.)

41. No comment.

42. I associate "There's no plan" to the field being in a preparadigmatic state. I agree that it would be very much preferable if this weren't the state of affairs, so that we could be in a position to design a plan.

43. This part hit home: "not an uncomfortable shrug and 'How can you be sure that will happen' / 'There's no way you could be sure of that now, we'll have to wait on experimental evidence.'" I am sad that the Standard Response to AGI risk is "AI won't be intelligent enough to do that". (Not to say that there aren't stronger counterarguments).

Comment by Loppukilpailija (jarviniemi) on [RETRACTED] It's time for EA leadership to pull the short-timelines fire alarm. · 2022-04-08T17:41:39.013Z · LW · GW

As a non-expert, I'm confused about what exactly was so surprising in the works which causes a strong update. "The intersection of many independent, semi-likely events is unlikely" could be one answer, but I'm wondering whether there is more to it. In particular, I'm confused why the data is evidence for a fast take-off in contrast to a slow one.