Posts

Hiatus: EA and LW post summaries 2023-05-17T17:17:08.525Z
Summaries of top forum posts (1st to 7th May 2023) 2023-05-09T09:30:35.133Z
Summaries of top forum posts (24th - 30th April 2023) 2023-05-02T02:30:59.447Z
Summaries of top forum posts (17th - 23rd April 2023) 2023-04-24T04:13:31.797Z
Summaries of top forum posts (27th March to 16th April) 2023-04-17T00:28:18.037Z
EA & LW Forum Weekly Summary (20th - 26th March 2023) 2023-03-27T20:46:55.269Z
EA & LW Forum Weekly Summary (13th - 19th March 2023) 2023-03-20T04:18:18.505Z
AI Safety - 7 months of discussion in 17 minutes 2023-03-15T23:41:37.390Z
EA & LW Forum Weekly Summary (6th - 12th March 2023) 2023-03-14T03:01:06.805Z
EA & LW Forum Weekly Summary (27th Feb - 5th Mar 2023) 2023-03-06T03:18:32.184Z
EA & LW Forum Weekly Summary (20th - 26th Feb 2023) 2023-02-27T03:46:39.842Z
EA & LW Forum Weekly Summary (6th - 19th Feb 2023) 2023-02-21T00:26:33.146Z
EA & LW Forum Weekly Summary (30th Jan - 5th Feb 2023) 2023-02-07T02:13:13.160Z
EA & LW Forum Weekly Summary (23rd - 29th Jan '23) 2023-01-31T00:36:14.553Z
EA & LW Forum Weekly Summary (16th - 22nd Jan '23) 2023-01-23T03:46:10.759Z
EA & LW Forum Summaries (9th Jan to 15th Jan 23') 2023-01-18T07:29:06.656Z
EA & LW Forum Summaries - Holiday Edition (19th Dec - 8th Jan) 2023-01-09T21:06:34.731Z
EA & LW Forums Weekly Summary (12th Dec - 18th Dec 22') 2022-12-20T09:49:51.463Z
EA & LW Forums Weekly Summary (5th Dec - 11th Dec 22') 2022-12-13T02:53:29.254Z
EA & LW Forums Weekly Summary (28th Nov - 4th Dec 22') 2022-12-06T09:38:16.552Z
EA & LW Forums Weekly Summary (14th Nov - 27th Nov 22') 2022-11-29T23:00:00.461Z
EA & LW Forums Weekly Summary (7th Nov - 13th Nov 22') 2022-11-16T03:04:39.398Z
EA & LW Forums Weekly Summary (31st Oct - 6th Nov 22') 2022-11-08T03:58:25.600Z
EA & LW Forums Weekly Summary (24 - 30th Oct 22') 2022-11-01T02:58:09.914Z
EA & LW Forums Weekly Summary (17 - 23 Oct 22') 2022-10-25T02:57:43.696Z
EA & LW Forums Weekly Summary (10 - 16 Oct 22') 2022-10-17T22:51:04.175Z
EA & LW Forums Weekly Summary (26 Sep - 9 Oct 22') 2022-10-10T23:58:22.991Z
EA & LW Forums Weekly Summary (19 - 25 Sep 22') 2022-09-28T20:18:08.650Z
EA & LW Forums Weekly Summary (12 - 18 Sep '22) 2022-09-19T05:08:43.021Z
EA & LW Forums Weekly Summary (5 - 11 Sep 22') 2022-09-12T23:24:54.499Z
EA & LW Forums Weekly Summary (28 Aug - 3 Sep 22’) 2022-09-06T11:06:25.230Z
EA & LW Forums Weekly Summary (21 Aug - 27 Aug 22') 2022-08-30T01:42:39.309Z

Comments

Comment by Zoe Williams (GreyArea) on Inner Misalignment in "Simulator" LLMs · 2023-02-01T22:06:01.722Z · LW · GW

Post summary (feel free to suggest edits!):
The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.

They give an example of each failure case: 
For outer alignment, say researchers want a chatbot that gives helpful, honest answers - but end up with a sycophant who tells the user what they want to hear. For inner alignment, imagine a prompt engineer asking the chatbot to reply with how to solve the Einstein-Durkheim-Mendel conjecture as if they were ‘Joe’, who’s awesome at quantum sociobotany. But the AI thinks the ‘Joe’ character secretly cares about paperclips, so gives an answer that will help create a paperclip factory instead.

(This will appear in this week's forum summary. If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on Setting the Zero Point · 2022-12-12T02:11:24.306Z · LW · GW

Post summary (feel free to suggest edits!):
‘Setting the Zero Point’ is a “Dark Art” ie. something which causes someone else’s map to unmatch the territory in a way that’s advantageous to you. It involves speaking in a way that takes for granted that the line between ‘good’ and ‘bad is at a particular point, without explicitly arguing for that. This makes changes between points below and above that line feel more significant.

As an example, many people draw a zero point between helping and not helping a child drowning in front of them. One is good, one is bad. The Drowning Child argument argues this point is wrongly set, and should be between helping and not helping any dying child.

The author describes 14 examples, and suggests that it’s useful to be aware of this dynamic and explicitly name zero points when you notice them.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on Why Would AI "Aim" To Defeat Humanity? · 2022-12-06T21:13:41.222Z · LW · GW

Currently it's all manually, but the ChatGPT summaries are pretty decent, I'm looking into which types of posts it does well.

Comment by Zoe Williams (GreyArea) on Why Would AI "Aim" To Defeat Humanity? · 2022-12-06T09:09:31.274Z · LW · GW

Post summary (feel free to suggest edits!):
The author argues that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures). 

This is because there is good economic reason to have AIs ‘aim’ at certain outcomes - eg. We might want an AI that can accomplish goals such as ‘get me a TV for a great price’. Current methods train AIs to do this via trial and error, but because we ourselves are often misinformed, we can sometimes negatively reinforce truthful behavior and positively reinforce deception that makes it look like things are going well. This can mean AIs learn an unintended aim, which if ambitious enough, is very dangerous. There are also intermediate goals like ‘don’t get turned off’ and ‘control the world’ that are useful for almost any ambitious aim.

Warning signs for this scenario are hard to observe, because of the deception involved. There will likely still be some warning signs, but in a situation with incentives to roll out powerful AI as fast as possible, responses are likely to be inadequate.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on Distinguishing test from training · 2022-12-06T08:55:57.183Z · LW · GW

Post summary (feel free to suggest edits!):
Various people have proposed variants of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. This seems unpromising because AGI could still have bad outcomes if convinced, and convincing it would be difficult.

Non-exhaustive list of how it could tell it’s in reality:

  1. Reality is large (eg. some things are possible that couldn’t be easily spoofed, such as access to larger compute)
  2. It’s the first place the AI’s history could show interaction with other complex systems (eg. humans reacting on scale to the AI’s actions)
  3. It might notice the world in it’s in the type that’s less likely to be simulated (eg. consistent physics, no bugs)

If you can understand the contents of the AI’s mind well enough to falsify every possible check it could do to determine the difference between simulation and reality, then you could use that knowledge to build a friendly AI that doesn’t need to be fooled in the first place.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on Be less scared of overconfidence · 2022-12-06T03:52:08.649Z · LW · GW

Post summary (feel free to suggest edits!):
The author gives examples where their internal mental model suggested one conclusion, but a low-information heuristic like expert or market consensus differed, so they deferred. This included:

  • Valuing Theorem equity over Wave equity, despite Wave’s founders being very resourceful and adding users at a huge pace.
  • In the early days of Covid, dismissing it despite exponential growth and asymptomatic spread seeming intrinsically scary.

Another common case of this principle is assuming something won’t work in a particular case, because the stats for the general case are bad. (eg. ‘90% of startups fail - why would this one succeed?’), or assuming something will happen similarly to past situations.

Because the largest impact comes from outlier situations, outperforming these heuristics is important. The author suggests that for important decisions people should build a gears-level model of the decision, put substantial time into building an inside view, and use heuristics to stress test those views. They also suggest being ambitious, particularly when it’s high upside and low downside.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on The Plan - 2022 Update · 2022-12-06T03:51:14.101Z · LW · GW

Cheers, edited :)

Comment by Zoe Williams (GreyArea) on On the Diplomacy AI · 2022-12-06T03:37:23.591Z · LW · GW

Post summary (feel free to suggest edits!):
The Diplomacy AI got a handle on the basics of the game, but didn’t ‘solve it’. It mainly does well due to avoiding common mistakes like eg. failing to communicate with victims (thus signaling intention), or forgetting the game ends after the year 1908. It also benefits from anonymity, one-shot games, short round limits etc.

Some things were easier than expected eg. defining the problem space, communications generic and simple and quick enough to easily imitate and even surpass humans, no reputational or decision theoretic considerations, you can respond to existing metagame without it responding to you. Others were harder eg. tactical and strategic engines being lousy (relative to what the author would have expected).

Overall the author did not on net update much on the Diplomacy AI news, in any direction, as nothing was too shocking and the surprises often canceled out.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on Did ChatGPT just gaslight me? · 2022-12-06T03:20:33.023Z · LW · GW

Post summary (feel free to suggest edits!):
In chatting with ChatGPT, the author found it contradicted itself and its previous answers. For instance, it said that orange juice would be a good non-alcoholic substitute for tequila because both were sweet, but when asked if tequila was sweet it said it was not. When further quizzed, it apologized for being unclear and said “When I said that tequila has a "relatively high sugar content," I was not suggesting that tequila contains sugar.”

This behavior is worrying because the system has the capacity to produce convincing, difficult to verify, completely false information. Even if this exact pattern is patched, others will likely emerge. The author guesses it produced the false information because it was trained to give outputs the user would like - in this case a non-alcoholic sub for tequila in a drink, with a nice-sounding reason.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on The Plan - 2022 Update · 2022-12-06T03:08:22.555Z · LW · GW

Post summary (feel free to suggest edits!):
Last year, the author wrote up an plan they gave a “better than 50/50 chance” would work before AGI kills us all. This predicted that in 4-5 years, the alignment field would progress from preparadigmatic (unsure of the right questions or tools) to having a general roadmap and toolset. 

They believe this is on track and give 40% likelihood that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets - with interpretability on the experimental side, in addition to theoretical work. This could lead to identifying which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them. They think we should then focus on those.

In their personal work, they’ve found theory work faster than expected, and crossing the theory-practice gap mildly slower. In 2022 most of their time went into theory work like the Basic Foundations sequence, workshops and conferences, training others, and writing up intro-level arguments on alignment strategies.

(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)

Comment by Zoe Williams (GreyArea) on EA & LW Forums Weekly Summary (10 - 16 Oct 22') · 2022-10-31T02:47:27.307Z · LW · GW

Thanks for the feedback! I've passed it on.

It's mainly because we wanted to keep the episodes to ~20m, to make them easy for people to keep up with week to week - and the LW posts tended toward the more technical side, which doesn't translate as easily in podcast form (it can be hard to take in without the writing in front of you). We may do something for the LW posts in future though, unsure at this point.

Comment by Zoe Williams (GreyArea) on EA & LW Forums Weekly Summary (10 - 16 Oct 22') · 2022-10-18T19:53:32.596Z · LW · GW

Thanks, realized I forgot to add the description of the top / curated section - fixed. Everything in there occurs in it's own section too.

Comment by Zoe Williams (GreyArea) on EA & LW Forums Weekly Summary (26 Sep - 9 Oct 22') · 2022-10-17T22:20:59.282Z · LW · GW

Thanks, great to hear!

Comment by Zoe Williams (GreyArea) on EA & LW Forums Weekly Summary (19 - 25 Sep 22') · 2022-09-29T18:44:13.012Z · LW · GW

Thanks for the info - added to post

Comment by Zoe Williams (GreyArea) on Survey of NLP Researchers: NLP is contributing to AGI progress; major catastrophe plausible · 2022-09-06T00:16:55.662Z · LW · GW

Super interesting, thanks!

If you were running it again, you might want to think about standardizing the wording of the questions - it varies from 'will / is' to 'is likely' to 'plausible' and this can make it hard to compare between questions. Plausible in particular is quite a fuzzy word, for some it might mean 1% or more, for others it might just mean it's not completely impossible / if a movie had that storyline, they'd be okay with it.

Comment by Zoe Williams (GreyArea) on EA & LW Forums Weekly Summary (21 Aug - 27 Aug 22') · 2022-08-30T11:11:51.055Z · LW · GW

Great to hear, thanks :-)

Comment by Zoe Williams (GreyArea) on EA & LW Forums Weekly Summary (21 Aug - 27 Aug 22') · 2022-08-30T11:11:20.712Z · LW · GW

Good point, thank you - I've had a re-read of the conclusion and replaced the sentence with "Due to this, he concludes that climate change is still an important LT area - though not as important as some other global catastrophic risks (eg. biorisk), which outsize on both neglectedness and scale."

Originally I think I'd mistaken his position a bit based on this sentence: "Overall, because other global catastrophic risks are so much more neglected than climate change, I think they are more pressing to work on, on the margin." (and in addition I hadn't used the clearest phrasing)  But the wider conclusion fits the new sentence better.