LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

An even deeper atheism
Joe Carlsmith (joekc) · 2024-01-11T17:28:31.843Z · comments (47)

Parasites (not a metaphor)
lemonhope (lcmgcd) · 2024-08-08T20:07:13.593Z · comments (17)

"The Solomonoff Prior is Malign" is a special case of a simpler argument
David Matolcsi (matolcsid) · 2024-11-17T21:32:34.711Z · comments (44)

A Three-Layer Model of LLM Psychology
Jan_Kulveit · 2024-12-26T16:49:41.738Z · comments (8)

Current AIs Provide Nearly No Data Relevant to AGI Alignment
Thane Ruthenis · 2023-12-15T20:16:09.723Z · comments (157)

[link] Bayesian Injustice
Kevin Dorst · 2023-12-14T15:44:08.664Z · comments (10)

[link] OpenAI's CBRN tests seem unclear
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:28:30.290Z · comments (6)

Things I've Grieved
Raemon · 2024-02-18T19:32:47.169Z · comments (6)

[link] Steering Llama-2 with contrastive activation additions
Nina Panickssery (NinaR) · 2024-01-02T00:47:04.621Z · comments (29)

Natural Latents: The Math
johnswentworth · 2023-12-27T19:03:01.923Z · comments (40)

BIG-Bench Canary Contamination in GPT-4
Jozdien · 2024-10-22T15:40:48.166Z · comments (13)

[question] What do coherence arguments actually prove about agentic behavior?
sunwillrise (andrei-alexandru-parfeni) · 2024-06-01T09:37:28.451Z · answers+comments (35)

Passages I Highlighted in The Letters of J.R.R.Tolkien
Ivan Vendrov (ivan-vendrov) · 2024-11-25T01:47:59.071Z · comments (18)

Awakening
lsusr · 2024-05-30T07:03:00.821Z · comments (79)

Do you believe in hundred dollar bills lying on the ground? Consider humming
Elizabeth (pktechgirl) · 2024-05-16T00:00:05.257Z · comments (22)

Hire (or Become) a Thinking Assistant
Raemon · 2024-12-23T03:58:42.061Z · comments (46)

[link] Investigating the Chart of the Century: Why is food so expensive?
Maxwell Tabarrok (maxwell-tabarrok) · 2024-08-16T13:21:23.596Z · comments (26)

Why I take short timelines seriously
NicholasKees (nick_kees) · 2024-01-28T22:27:21.098Z · comments (29)

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Erik Jenner (ejenner) · 2024-06-04T15:50:47.475Z · comments (14)

AI catastrophes and rogue deployments
Buck · 2024-06-03T17:04:51.206Z · comments (16)

RTFB: On the New Proposed CAIP AI Bill
Zvi · 2024-04-10T18:30:08.410Z · comments (14)

[link] The Dangers of Mirrored Life
Niko_McCarty (niko-2) · 2024-12-12T20:58:32.750Z · comments (7)

A bird's eye view of ARC's research
Jacob_Hilton · 2024-10-23T15:50:06.123Z · comments (12)

[question] Which skincare products are evidence-based?
Vanessa Kosoy (vanessa-kosoy) · 2024-05-02T15:22:12.597Z · answers+comments (48)

[link] Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded
garrison · 2024-10-23T23:40:57.180Z · comments (1)

The Standard Analogy
Zack_M_Davis · 2024-06-03T17:15:42.327Z · comments (28)

Efficient Dictionary Learning with Switch Sparse Autoencoders
Anish Mudide (anish-mudide) · 2024-07-22T18:45:53.502Z · comments (19)

AI Alignment Metastrategy
Vanessa Kosoy (vanessa-kosoy) · 2023-12-31T12:06:11.433Z · comments (13)

2024 in AI predictions
jessicata (jessica.liu.taylor) · 2025-01-01T20:29:49.132Z · comments (3)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey (Lee_Sharkey) · 2024-07-18T14:15:50.248Z · comments (18)

The Dream Machine
sarahconstantin · 2024-12-05T00:00:05.796Z · comments (6)

Scissors Statements for President?
AnnaSalamon · 2024-11-06T10:38:21.230Z · comments (32)

The o1 System Card Is Not About o1
Zvi · 2024-12-13T20:30:08.048Z · comments (5)

[link] My Number 1 Epistemology Book Recommendation: Inventing Temperature
adamShimi · 2024-09-08T14:30:40.456Z · comments (18)

Social status part 1/2: negotiations over object-level preferences
Steven Byrnes (steve2152) · 2024-03-05T16:29:07.143Z · comments (15)

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
Thomas Kwa (thomas-kwa) · 2024-11-06T23:01:48.992Z · comments (35)

Anthropic's Certificate of Incorporation
Zach Stein-Perlman · 2024-06-12T13:00:30.806Z · comments (7)

[link] Anthropic release Claude 3, claims >GPT-4 Performance
LawrenceC (LawChan) · 2024-03-04T18:23:54.065Z · comments (41)

Why I funded PIBBSS
Ryan Kidd (ryankidd44) · 2024-09-15T19:56:33.018Z · comments (21)

Talent Needs of Technical AI Safety Teams
yams (william-brewer) · 2024-05-24T00:36:40.486Z · comments (65)

[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty
tandem · 2025-01-07T19:11:21.238Z · comments (4)

Mapping the semantic void: Strange goings-on in GPT embedding spaces
mwatkins · 2023-12-14T13:10:22.691Z · comments (31)

Rationality Research Report: Towards 10x OODA Looping?
Raemon · 2024-02-24T21:06:38.703Z · comments (21)

[link] Gender Exploration
sapphire (deluks917) · 2024-01-14T18:57:32.893Z · comments (25)

[link] introduction to cancer vaccines
bhauth · 2024-05-05T01:06:16.972Z · comments (19)

Simple versus Short: Higher-order degeneracy and error-correction
Daniel Murfet (dmurfet) · 2024-03-11T07:52:46.307Z · comments (6)

The Plan - 2024 Update
johnswentworth · 2024-12-31T13:29:53.888Z · comments (27)

The Pareto Best and the Curse of Doom
Screwtape · 2024-02-21T23:10:01.359Z · comments (21)

AIs Will Increasingly Attempt Shenanigans
Zvi · 2024-12-16T15:20:05.652Z · comments (2)

You should consider applying to PhDs (soon!)
bilalchughtai (beelal) · 2024-11-29T20:33:12.462Z · comments (19)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

francescag on Extending control evaluations to non-scheming threats

Thanks for putting this out there, I’m on the Catalyze program in Lisa in London and have been thinking a lot about AI control (for framing my background is risk management, so coming at it from that angle).

'The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular.'

I’m on board with this, if our goal is to bound the risk from worst-case deployment behaviour, I think the broader the better and enumerating different ‘untrusted model categories’ is going to better inform our understanding of harmful behaviours. If we have post deployment monitoring, and identify new patterns of harmful behaviour, we may even identify new untrusted model categories in this way.

For post deployment monitoring (let’s say the control eval highlighted some behaviour we wanted ongoing control), I think the reason (i.e the untrusted model category causing the behaviour) is less important than the observed behaviour itself. There could even be multiple applying untrusted model categories e.g. scheming and jailbroken, so you could have a single monitor to catch all of these. What are your thoughts on this?

‘control is a broad approach to mitigating and evaluating safety given ’pessimistic assumptions’ about model deployment behavior.’

Rather than ‘pessimistic assumptions’ here I describe these as ‘unknowns’, the probability of which would range somewhere between 0-1 based in our confidence in the representativeness in the evaluations.

I think to make control evaluations practically useful, we need to add more nuance to how we define trusted and untrusted. Coming from risk management (in Fintech and FS), I’m approaching it from a view that a model can only ever be P(Trusted)=1 (i.e. fully trusted) when bounded by certain deployment constraints. So a model is trusted to do something rather than generally trusted e.g. it may be trusted to not write backdoored code, but may not be trusted to keep sensitive data safe if its in a setting where users can jailbreak it.

This could also help justify claims around a ‘rarely misbehaving monitor being safe because its failures will likely be uncorrelated with attacks executed by the policy.’ Let’s say you assign a probability to this of 0.001, which is fine for a use case of coding a website but not ok for coding a banking platform. You could layer more monitors to reduce this probability for this use case, effectively getting the risk within appetite for this deployment setting e.g. to 0.000001.

taleuntum on We probably won't just play status games with each other after AGI

As AIs don't have the same origin as humans, it is basically inconceivable to me that they will ever share the internal processes underlying their "emotions" no matter how good they get at surface "emoting", in my opinion this makes it impossible to have a true connection/meaningful relationship with them as humans, even if many people in the future will fail to see this, but otherwise I largely agree with your post (a brain implemented in silicon could be a meaningful friend to a human (eg ems), people will prob fully integrate AIs into the social scene (this is a mistake in my view), people will modify their brains in (what now seems like) radical ways).

bilalchughtai on How do you deal w/ Super Stimuli?

As a general rule, I try and minimise my phone screen time and maximise my laptop screen time. I can do every "productive" task faster on a laptop than on my phone. Here are some things I do that I find helpful that I haven't yet seen discussed.

Use a very minimalist app launcher on my phone, that makes searching for apps a conscious decision.
Use a greyscale filter on my phone (which is hard to turn off), as this makes doing most things on my phone harder.

viliam on Shortform

There are many possible maps that describe the same territory. Trying to switch people to use a different map could be a good thing, or it could be a bad thing. (A person who likes the new map might describe it as "giving them fresh insights", a person who dislikes it might describe it as "manipulating them".)

Is the scientific map always better? Well, sometimes it is not available. And sometimes it is too complex. In situations where science provides a clear and simple answer, I guess following it is very likely to be the right answer. But that is not always the case, and then... what are the alternatives? Either paralysis ("I am going to ignore this topic until science finally comes with a simple answer") or some kind of greedy reductionism / focusing on what is legible ("I am going to ignore the illegible parts and focus on the part that is certain: everything, including my wife and kids, is ultimately built from atoms and anything else about them is mere superstition").

viliam on Daniel Tan's Shortform

Off topic, but your words helped me realize something. It seems like for some people it is physical attraction first, for others it is emotional connection first.

The former may perceive the latter as dishonest: if their model of the world is that for everyone it is physical attraction first (it is only natural to generalize from one example), then what you describe as "take my time getting to know someone organically", they interpret as "actually I was attracted to the person since the first sight, but I was afraid of a rejection, so I strategically pretended to be a friend first, so that I could later blackmail them into having sex by threatening to withdraw the friendship they spent a lot of time building".

Basically, from the "for everyone it is attraction first" perspective, the honest behavior is either going for the sex immediately ("hey, you're hot, let's fuck" or a more diplomatic version thereof), or deciding that you are not interested sexually, and then the alternatives are either walking away, or developing a friendship that will remain safely sexless forever.

And from the other side, complaining about the "friend zone" is basically complaining that too many people you are attracted to happen to be "physical attraction first" (and they don't find you attractive), but it takes you too long to find out.

abandon on White Lies

You have misunderstood a standard figure of speech. Here is the definition he was using: https://www.ldoceonline.com/dictionary/to-be-fair (see also https://dictionary.cambridge.org/us/dictionary/english/to-be-fair, which doesn't explicitly mention that it's typically used to offset criticisms but otherwise defines it more thoroughly).

adam-b on Introducing Fatebook: the fastest way to make and track predictions

Thanks Raemon – I'm a fan of ~all these ideas! I'm not spending much time on Fatebook specifically these days (busy with various other projects), but your feedback is and has been super useful.

richard_kennaway on Passages I Highlighted in The Letters of J.R.R.Tolkien

We are in our native language, so we should work from there.

And begin by stepping outside it.

richard_kennaway on Passages I Highlighted in The Letters of J.R.R.Tolkien

Gandalf as Ring-Lord would have been far worse than Sauron. He would have remained ‘righteous’, but self-righteous. He would have continued to rule and order things for ‘good’, and the benefit of his subjects according to his wisdom (which was and would have remained great). [The draft ends here. In the margin Tolkien wrote: ‘Thus while Sauron multiplied [illegible word] evil, he left “good” clearly distinguishable from it. Gandalf would have made good detestable and seem evil.’]

So Tolkien anticipated the rise of woke!

adam-b on Adam B's Shortform

Here's what the question creation interface looks like – you can create any question, and there's a bunch of suggestions to help you get started:

Here's what it looks like when you've created some questions. I made these predictions with my partner, so you can see her predictions on the question that's expanded:

For more info, I'd suggest just using the website itself. It is designed to be super easy to use and self-describing!

The website is part of Fatebook, a tool for rapidly tracking predictions. Predict Your Year is a specialised page for creating yearly predictions (which is a popular activity for many people, e.g. inspired by Scott Alexander's annual prediction posts). You can read more about Fatebook in this LessWrong post [LW · GW].