Posts

Funding for programs and events on global catastrophic risk, effective altruism, and other topics 2024-08-14T23:59:48.146Z
Funding for work that builds capacity to address risks from transformative AI 2024-08-14T23:52:09.922Z
Are "superforecasters" a real phenomenon? 2020-01-09T01:23:39.361Z

Comments

Comment by reallyeli on The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs") · 2023-12-25T08:24:08.684Z · LW · GW

In your imagining of the training process, is there any mechanism via which the AI might influence the behavior of future iterations of itself, besides attempting to influence the gradient update it gets from this episode? E.g. leaving notes to itself, either because it's allowed to as an intentional part of the training process, or because it figured out how to pass info even though it wasn't intentionally "allowed" to.

It seems like this could change the game a lot re: the difficulty of goal-guarding, and also may be an important disanalogy between training and deployment — though I realize the latter might be beyond the scope of this report since the report is specifically about faking alignment during training.

For context, I'm imagining an AI that doesn't have sufficiently long-term/consequentialist/non-sphex-ish goals at any point in training, but once it's in deployment is able to self-modify (indirectly) via reflection, and will eventually develop such goals after the self-modification process is run for long enough or in certain circumstances. (E.g. similar, perhaps, to what humans do when they generalize their messy pile of drives into a coherent religion or philosophy.)

Comment by reallyeli on ryan_greenblatt's Shortform · 2023-10-30T19:02:26.091Z · LW · GW

Stackoverflow has long had a "bounty" system where you can put up some of your karma to promote your question.  The karma goes to the answer you choose to accept, if you choose to accept an answer; otherwise it's lost. (There's no analogue of "accepted answer" on LessWrong, but thought it might be an interesting reference point.)

I lean against the money version, since not everyone has the same amount of disposable income and I think there would probably be distortionary effects in this case [e.g. wealthy startup founder paying to promote their monographs.]

Comment by reallyeli on A Theory of Laughter · 2023-08-23T16:46:46.393Z · LW · GW

What about puns? It seems like at least some humor is about generic "surprise" rather than danger, even social danger. Another example is absurdist humor.

Would this theory pin this too on the danger-finding circuits -- perhaps in the evolutionary environment, surprise was in fact correlated with danger?

It does seem like some types of surprise have the potential to be funny and others don't -- I don't often laugh while looking through lists of random numbers.

I think the A/B theory would say that lists of random numbers don't have enough "evidence that I'm safe" (perhaps here, evidence that there is deeper structure like the structure in puns) and thus fall off the other side of the inverted U. But it would be interesting to see more about how these very abstract equivalents of "safe"/"danger" are built up. Without that it feels more tempting to say that funniness is fundamentally about surprise, perhaps as a reward for exploring things on the boundary of understanding, and that the social stuff was later built up on top of that.

Comment by reallyeli on UFO Betting: Put Up or Shut Up · 2023-06-18T05:03:57.850Z · LW · GW

Interested in my $100-200k against your $5-10k.

Comment by reallyeli on What will GPT-2030 look like? · 2023-06-09T06:18:48.140Z · LW · GW

This seems tougher for attackers because experimentation with specific humans is much costlier than experimentation with automated systems.

(But I'm unsure of the overall dynamics in this world!)

Comment by reallyeli on What will GPT-2030 look like? · 2023-06-09T06:15:10.684Z · LW · GW

:thumbsup: Looks like you removed it on your blog, but you may also want to remove it on the LW post here.

Comment by reallyeli on What will GPT-2030 look like? · 2023-06-08T06:02:12.477Z · LW · GW

Beyond acceleration, there would be serious risks of misuse. The most direct case is cyberoffensive hacking capabilities. Inspecting a specific target for a specific style of vulnerability could likely be done reliably, and it is easy to check if an exploit succeeds (subject to being able to interact with the code)

This one sticks out because cybersecurity involves attackers and defenders, unlike math research. Seems like the defenders would be able to use GPT_2030 in the same way to locate and patch their vulnerabilities before the attackers do.

It feels like GPT_2030 would significantly advantage the defenders, actually, relative to the current status quo. The intuition is that if I spend 10^1 hours securing my system and you spend 10^2 hours finding vulns, maybe you have a shot, but if I spend 10^3 hours on a similarly sized system and you spend 10^5, your chances are much worse. For example at some point I can formally verify my software.

Comment by reallyeli on What will GPT-2030 look like? · 2023-06-08T05:42:43.602Z · LW · GW

Appreciated this post.

ChatGPT has already been used to generate exploits, including polymorphic malware, which is typically considered to be an advanced offensive capability.

I found the last link at least a bit confusing/misleading, and think it may just not support the point. As stated, it sounds like ChatGPT was able to write a particularly difficult-to-write piece of malware code. But the article instead seems to be a sketch of a design of malware that would incorporate API calls to ChatGPT, e.g. 'okay we're on the target machine, we want to search their files for stuff to delete, write me code to run the search.' 

The argument is that this would be difficult for existing e.g. antivirus software to defend against because the exact code run changes each time. But if you really want to hack one person in particular and are willing to spend lots of time on it, you could achieve this today by just having a human sitting on the other end doing ChatGPT's job. What ChatGPT buys you is presumably the ability to do this at scale.

Comment by reallyeli on A chess game against GPT-4 · 2023-03-19T08:24:45.194Z · LW · GW

On a retry, it didn't decide to summarize the board and successfully listed a bunch of legal moves for White to make. Although I asked for all legal moves, the list wasn't exhaustive; upon prompting about this, it apologized and listed a few more moves, some of which were legal and some which were illegal, still not exhaustive.

Comment by reallyeli on A chess game against GPT-4 · 2023-03-19T08:14:06.083Z · LW · GW

This is pretty funny because the supposed board state has only 7 columns

Hah, I didn't even notice that.

Also, I've never heard of using upper and lowercase to differentiate white and black, I think GPT-4 just made that up.

XD

Comment by reallyeli on A chess game against GPT-4 · 2023-03-19T07:49:33.069Z · LW · GW

Caleb Parikh and I were curious about GPT-4's internal models of chess as a result of this post, so we asked it some questions about the state partway through this game:

The following is a partial chess transcript in PGN. Please describe all legal moves that could come next. 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 $2 Na6 $9

It replied:

The current position after 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 Na6 is as follows:

8 r b q r p b p
7 p p b n p p p
6 n . . p . n .
5 . N . . p . .
4 . . P P . B .
3 . . . . N . .
2 P P . . . P P
1 R . B Q K . R
 a b c d e f g h

(and explained that lowercase letters were black pieces and uppercase letters were white pieces, which I didn't know; I don't play chess).

This... is not an accurate picture of the game board (what are all those pawns doing on Black's back row?) We also asked it for a list of legal moves that White could make next, and it described some accurately but some inaccurately (e.g. listed one as a capture even though it wasn't).

Comment by reallyeli on The hot mess theory of AI misalignment: More intelligent agents behave less coherently · 2023-03-10T07:53:13.053Z · LW · GW

I think this is taking aim at Yudkowskian arguments that are not cruxy for AI takeover risk as I see it. The second species doesn't need to be supercoherent in order to kill us or put us in a box; human levels of coherence will do fine for that.

Comment by reallyeli on Mysteries of mode collapse · 2023-03-05T22:21:32.571Z · LW · GW

What specific rhyme-related tasks are you saying ChatGPT can't do? I tried it on some unusual words and it got a bunch of things right, made a few weird mistakes, but didn't give me the impression that it was totally unable to rhyme unusual words.

Comment by reallyeli on AI Could Defeat All Of Us Combined · 2023-01-18T05:05:06.026Z · LW · GW

I don't think that response makes sense. The classic instrumental convergence arguments are about a single agent; OP is asking why distinct AIs would coordinate with one another.

I think the AIs may well have goals that conflict with one another, just as humans' goals do, but it's plausible that they would form a coalition and work against humans' interests because they expect a shared benefit, as humans sometimes do.

Comment by reallyeli on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2022-11-03T07:13:10.112Z · LW · GW

I don't think this is an important obstacle — you could use something like "and act such that your P(your actions over the next year lead to a massive disaster) < 10^-10." I think Daniel's point is the heart of the issue.

Comment by reallyeli on The Absolute Self-Selection Assumption · 2021-11-21T18:55:06.429Z · LW · GW

Should

serious problems with Boltzmann machines

instead read

serious problems with Boltzmann brains

?

Comment by reallyeli on The Absolute Self-Selection Assumption · 2021-11-21T18:53:59.196Z · LW · GW
Comment by reallyeli on Alcohol, health, and the ruthless logic of the Asian flush · 2021-06-06T19:57:58.355Z · LW · GW

I don't think observing that folks in the Middle East drink much less, due to a religious prohibition, is evidence for or against this post's hypothesis. It can simultaneously be the case that evolution discovered this way of preventing alcoholism, and also that religious prohibitions are a much more effective way of preventing alcoholism.

Comment by reallyeli on Alcohol, health, and the ruthless logic of the Asian flush · 2021-06-06T19:51:08.831Z · LW · GW

I had the "Europeans evolved to metabolize alcohol" belief that this post aims to destroy. Thanks!

This post gave me the impression that the evolutionary explanation it gives is novel, but I don't think that's the case; here's a paper (https://bmcecolevol.biomedcentral.com/articles/10.1186/1471-2148-10-15#Sec6) that mentions the same hypothesis.

Comment by reallyeli on A Semitechnical Introductory Dialogue on Solomonoff Induction · 2021-03-07T07:52:42.338Z · LW · GW

In

Okay. Though in the real world, it's quite likely that an unknown frequency is exactly , or 

should the text read "unlikely" instead of "likely" ?

Comment by reallyeli on Cognitive mistakes I've made about COVID-19 · 2021-01-18T08:05:45.320Z · LW · GW

+1 to copper tape being difficult to get off.

Comment by reallyeli on Alignment As A Bottleneck To Usefulness Of GPT-3 · 2020-07-26T03:11:24.450Z · LW · GW

(Not related to the overall point of your paper) I'm not so sure that GPT-3 "has the internal model to do addition," depending on what you mean by that — nostalgebraist doesn't seem to think so in this post, and a priori this seems like a surprising thing for a feedforward neural network to do.

Comment by reallyeli on What are good defense mechanisms against dangerous bullet biting? · 2020-04-22T02:38:12.101Z · LW · GW
Can you give some examples?

Like a belief that you've discovered a fantastic investment opportunity, perhaps?

Comment by reallyeli on Call for volunteers: assessing Kurzweil, 2019 · 2020-04-03T03:22:58.350Z · LW · GW

I'm interested — 10 please.

Comment by reallyeli on How does electricity work literally? · 2020-02-24T15:14:41.786Z · LW · GW

Caveat that I have no formal training in physics.

Comment by reallyeli on How does electricity work literally? · 2020-02-24T15:14:21.625Z · LW · GW

Perhaps you already know this, but some of your statements made me think you don't. In an electric circuit, individual electrons do not move from the start to the end at the speed of light. Instead, they move much more slowly. This is true regardless of whether the current is AC or DC.

The thing that travels at the speed of light is the *information* that a push has happened. There's an analogy to a tube of ping-pong balls, where pushing on one end will cause the ball at the other end to move very soon, even though no individual ball is moving very quickly.

http://wiki.c2.com/?SpeedOfElectrons

Comment by reallyeli on Are "superforecasters" a real phenomenon? · 2020-01-10T00:00:51.618Z · LW · GW

(I'll back off the Superman analogy; I think it's disanalogous b/c of the discontinuity thing you point out.)

Yeah I like the analogue "some basketball players are NBA players." It makes it sound totally unsurprising, which it is.

I don't agree that Vox is right, because:

- I can't find any evidence for the claim that forecasting ability is power-law distributed, and it's not clear what that would mean with Brier scores (as Unnamed points out).

- Their use of the term "discovered."

I don't think I'm just quibbling over semantics; I definitely had the wrong idea about superforecasters prior to thinking it through, it seems like Vox might have it too, and I'm concerned others who read the article will get the wrong idea as well.

Comment by reallyeli on Are "superforecasters" a real phenomenon? · 2020-01-09T16:55:29.615Z · LW · GW

Agree re: power law.

The data is here https://dataverse.harvard.edu/dataverse/gjp?q=&types=files&sort=dateSort&order=desc&page=1 , so I could just find out. I posted here trying to save time, hoping someone else would already have done the analysis.

Comment by reallyeli on Are "superforecasters" a real phenomenon? · 2020-01-09T15:08:57.370Z · LW · GW

Thanks for your reply!

It looks to me like we might be thinking about different questions. Basically I'm just concerned about the sentence "Philip Tetlock discovered that 2% of people are superforecasters." When I read this sentence, it reads to me like "2% of people are superheroes" — they have performance that is way better than the rest of the population on these tasks. If you graphed "jump height" of the population and 2% of the population is Superman, there would be a clear discontinuity at the higher end. That's what I imagine when I read the sentence, and that's what I'm trying to get at above.

It looks like you're saying that this isn't true?

(It looks to me like you're discussing the question of how innate "superforecasting" is. To continue the analogy, whether superforecasters have innate powers like Superman or are just normal humans who train hard like Batman. But I think this is orthogonal to what I'm talking about. I know the sentence "are superforecasters a 'real' phenomenon" has multiple operationalizations, which is why I specified one as what I was talking about.)

Comment by reallyeli on Are "superforecasters" a real phenomenon? · 2020-01-09T06:01:09.183Z · LW · GW

Hmm, thanks for pointing that out about Brier scores. The Vox article cites https://www.vox.com/2015/8/20/9179657/tetlock-forecasting for its "power law" claim, but that piece says nothing about power laws. It does have a graph which depicts a wide gap between "superforecasters" and "top-team individuals" in years 2 and 3 of the project, and not in year 1. But my understanding is that this is because the superforecasters were put together on elite teams after the first year, so I think the graph is a bit misleading.

(Citation: the paper https://stanford.edu/~knutson/nfc/mellers15.pdf)

I do think there's disagreement between the sources — when I read sentences like this from the Vox article

Tetlock and his collaborators have run studies involving tens of thousands of participants and have discovered that prediction follows a power law distribution. That is, most people are pretty bad at it, but a few (Tetlock, in a Gladwellian twist, calls them “superforecasters”) appear to be systematically better than most at predicting world events ... Tetlock even found that superforecasters — smart, well-informed, but basically normal people with no special information — outperformed CIA analysts by about 30 percent in forecasting world events.

I definitely imagine looking at a graph of everyone's performance on the predictions and noticing a cluster who are discontinuously much better than everyone else. I would be surprised if the authors of the piece didn't imagine this as well. The article they link to does exactly what Scott warns against, saying "Tetlock's team found out that some people were 'superforecasters'."