LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI Constitutions are a tool to reduce societal scale risk
Sammy Martin (SDM) · 2024-07-25T11:18:17.826Z · comments (2)

AI #74: GPT-4o Mini Me and Llama 3
Zvi · 2024-07-25T13:50:06.528Z · comments (6)

A Case for Superhuman Governance, using AI
ozziegooen · 2024-06-07T00:10:10.902Z · comments (0)

Sparse MLP Distillation
slavachalnev · 2024-01-15T19:39:02.926Z · comments (3)

Putting multimodal LLMs to the Tetris test
Lovre · 2024-02-01T16:02:12.367Z · comments (5)

"Full Automation" is a Slippery Metric
ozziegooen · 2024-06-11T19:56:49.855Z · comments (1)

Running the Numbers on a Heat Pump
jefftk (jkaufman) · 2024-02-09T03:00:04.920Z · comments (12)

The Third Gemini
Zvi · 2024-02-20T19:50:05.195Z · comments (2)

Some additional SAE thoughts
Hoagy · 2024-01-13T19:31:40.089Z · comments (4)

The Intentional Stance, LLMs Edition
Eleni Angelou (ea-1) · 2024-04-30T17:12:29.005Z · comments (3)

Interpreting Quantum Mechanics in Infra-Bayesian Physicalism
Yegreg · 2024-02-12T18:56:03.967Z · comments (6)

I played the AI box game as the Gatekeeper — and lost
datawitch · 2024-02-12T18:39:35.777Z · comments (53)

The Math of Suspicious Coincidences
Roko · 2024-02-07T13:32:35.513Z · comments (3)

Improving Model-Written Evals for AI Safety Benchmarking
Sunishchal Dev (sunishchal-dev) · 2024-10-15T18:25:08.179Z · comments (0)

[link] What I expected from this site: A LessWrong review
Nathan Young · 2024-12-20T11:27:39.683Z · comments (5)

AI Safety Seed Funding Network - Join as a Donor or Investor
Alexandra Bos (AlexandraB) · 2024-12-16T19:30:43.812Z · comments (0)

Call for evaluators: Participate in the European AI Office workshop on general-purpose AI models and systemic risks
Tom DAVID (tom-david) · 2024-11-27T02:54:16.263Z · comments (0)

People aren't properly calibrated on FrontierMath
cakubilo · 2024-12-23T19:35:44.467Z · comments (4)

A Principled Cartoon Guide to NVC
plex (ete) · 2025-01-07T21:01:07.904Z · comments (5)

[link] My Methodological Turn
adamShimi · 2024-09-29T15:01:45.986Z · comments (0)

Aligning AI Safety Projects with a Republican Administration
Deric Cheng (deric-cheng) · 2024-11-21T22:12:27.502Z · comments (1)

Acknowledging Background Information with P(Q|I)
JenniferRM · 2024-12-24T18:50:25.323Z · comments (8)

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces
Matthew A. Clarke (Antigone) · 2024-12-20T15:16:51.857Z · comments (0)

You can validly be seen and validated by a chatbot
Kaj_Sotala · 2024-12-20T12:00:03.015Z · comments (3)

[question] Why are there no interesting (1D, 2-state) quantum cellular automata?
Optimization Process · 2024-11-26T00:11:37.833Z · answers+comments (13)

[link] A new process for mapping discussions
Nathan Young · 2024-09-30T08:57:20.029Z · comments (8)

[link] AI forecasting bots incoming
Dan H (dan-hendrycks) · 2024-09-09T19:14:31.050Z · comments (44)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

Disagreement on AGI Suggests It’s Near
tangerine · 2025-01-07T20:42:43.456Z · comments (15)

Inference-Time-Compute: More Faithful? A Research Note
James Chua (james-chua) · 2025-01-15T04:43:00.631Z · comments (4)

The new ruling philosophy regarding AI
Mitchell_Porter · 2024-11-11T13:28:24.476Z · comments (0)

[link] What fuels your ambition?
Cissy · 2024-01-31T18:30:53.274Z · comments (1)

Throughput vs. Latency
alkjash · 2024-01-12T21:37:07.632Z · comments (2)

D&D.Sci Hypersphere Analysis Part 1: Datafields & Preliminary Analysis
aphyer · 2024-01-13T20:16:39.480Z · comments (1)

Investigating Bias Representations in LLMs via Activation Steering
DawnLu · 2024-01-15T19:39:14.077Z · comments (4)

[question] Weighing reputational and moral consequences of leaving Russia or staying
spza · 2024-02-18T19:36:40.676Z · answers+comments (24)

[link] Abs-E (or, speak only in the positive)
dkl9 · 2024-02-19T21:14:32.095Z · comments (24)

Wholesome Culture
owencb · 2024-03-01T12:08:17.877Z · comments (3)

[link] Agreeing With Stalin in Ways That Exhibit Generally Rationalist Principles
Zack_M_Davis · 2024-03-02T22:05:49.553Z · comments (25)

[link] My MATS Summer 2023 experience
James Chua (james-chua) · 2024-03-20T11:26:14.944Z · comments (0)

Please Understand
samhealy · 2024-04-01T12:33:20.459Z · comments (11)

[link] The Poker Theory of Poker Night
omark · 2024-04-07T09:47:01.658Z · comments (13)

Experience Report - ML4Good AI Safety Bootcamp
Kieron Kretschmar · 2024-04-11T18:03:41.040Z · comments (0)

Reviewing the Structure of Current AI Regulations
Deric Cheng (deric-cheng) · 2024-05-07T12:34:17.820Z · comments (0)

Scorable Functions: A Format for Algorithmic Forecasting
ozziegooen · 2024-05-21T04:14:11.749Z · comments (0)

Quick Thoughts on Our First Sampling Run
jefftk (jkaufman) · 2024-05-23T00:20:02.050Z · comments (3)

Aggregative Principles of Social Justice
Cleo Nardo (strawberry calm) · 2024-06-05T13:44:47.499Z · comments (10)

Offering Completion
jefftk (jkaufman) · 2024-06-07T01:40:02.137Z · comments (6)

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking
tailcalled · 2024-06-10T21:20:11.938Z · comments (13)

[question] What Other Lines of Work are Safe from AI Automation?
RogerDearnaley (roger-d-1) · 2024-07-11T10:01:12.616Z · answers+comments (35)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

francescag on Extending control evaluations to non-scheming threats

Thanks for putting this out there, I’m on the Catalyze program in Lisa in London and have been thinking a lot about AI control (for framing my background is risk management, so coming at it from that angle).

'The high-level change I propose is to define a control evaluation as a method of bounding risk from worst-case deployment behavior generally – rather than scheming models in particular.'

I’m on board with this, if our goal is to bound the risk from worst-case deployment behaviour, I think the broader the better and enumerating different ‘untrusted model categories’ is going to better inform our understanding of harmful behaviours. If we have post deployment monitoring, and identify new patterns of harmful behaviour, we may even identify new untrusted model categories in this way.

For post deployment monitoring (let’s say the control eval highlighted some behaviour we wanted ongoing control), I think the reason (i.e the untrusted model category causing the behaviour) is less important than the observed behaviour itself. There could even be multiple applying untrusted model categories e.g. scheming and jailbroken, so you could have a single monitor to catch all of these. What are your thoughts on this?

‘control is a broad approach to mitigating and evaluating safety given ’pessimistic assumptions’ about model deployment behavior.’

Rather than ‘pessimistic assumptions’ here I describe these as ‘unknowns’, the probability of which would range somewhere between 0-1 based in our confidence in the representativeness in the evaluations.

I think to make control evaluations practically useful, we need to add more nuance to how we define trusted and untrusted. Coming from risk management (in Fintech and FS), I’m approaching it from a view that a model can only ever be P(Trusted)=1 (i.e. fully trusted) when bounded by certain deployment constraints. So a model is trusted to do something rather than generally trusted e.g. it may be trusted to not write backdoored code, but may not be trusted to keep sensitive data safe if its in a setting where users can jailbreak it.

This could also help justify claims around a ‘rarely misbehaving monitor being safe because its failures will likely be uncorrelated with attacks executed by the policy.’ Let’s say you assign a probability to this of 0.001, which is fine for a use case of coding a website but not ok for coding a banking platform. You could layer more monitors to reduce this probability for this use case, effectively getting the risk within appetite for this deployment setting e.g. to 0.000001.

taleuntum on We probably won't just play status games with each other after AGI

As AIs don't have the same origin as humans, it is basically inconceivable to me that they will ever share the internal processes underlying their "emotions" no matter how good they get at surface "emoting", in my opinion this makes it impossible to have a true connection/meaningful relationship with them as humans, even if many people in the future will fail to see this, but otherwise I largely agree with your post (a brain implemented in silicon could be a meaningful friend to a human (eg ems), people will prob fully integrate AIs into the social scene (this is a mistake in my view), people will modify their brains in (what now seems like) radical ways).

bilalchughtai on How do you deal w/ Super Stimuli?

As a general rule, I try and minimise my phone screen time and maximise my laptop screen time. I can do every "productive" task faster on a laptop than on my phone. Here are some things I do that I find helpful that I haven't yet seen discussed.

Use a very minimalist app launcher on my phone, that makes searching for apps a conscious decision.
Use a greyscale filter on my phone (which is hard to turn off), as this makes doing most things on my phone harder.

viliam on Shortform

There are many possible maps that describe the same territory. Trying to switch people to use a different map could be a good thing, or it could be a bad thing. (A person who likes the new map might describe it as "giving them fresh insights", a person who dislikes it might describe it as "manipulating them".)

Is the scientific map always better? Well, sometimes it is not available. And sometimes it is too complex. In situations where science provides a clear and simple answer, I guess following it is very likely to be the right answer. But that is not always the case, and then... what are the alternatives? Either paralysis ("I am going to ignore this topic until science finally comes with a simple answer") or some kind of greedy reductionism / focusing on what is legible ("I am going to ignore the illegible parts and focus on the part that is certain: everything, including my wife and kids, is ultimately built from atoms and anything else about them is mere superstition").

viliam on Daniel Tan's Shortform

Off topic, but your words helped me realize something. It seems like for some people it is physical attraction first, for others it is emotional connection first.

The former may perceive the latter as dishonest: if their model of the world is that for everyone it is physical attraction first (it is only natural to generalize from one example), then what you describe as "take my time getting to know someone organically", they interpret as "actually I was attracted to the person since the first sight, but I was afraid of a rejection, so I strategically pretended to be a friend first, so that I could later blackmail them into having sex by threatening to withdraw the friendship they spent a lot of time building".

Basically, from the "for everyone it is attraction first" perspective, the honest behavior is either going for the sex immediately ("hey, you're hot, let's fuck" or a more diplomatic version thereof), or deciding that you are not interested sexually, and then the alternatives are either walking away, or developing a friendship that will remain safely sexless forever.

And from the other side, complaining about the "friend zone" is basically complaining that too many people you are attracted to happen to be "physical attraction first" (and they don't find you attractive), but it takes you too long to find out.

abandon on White Lies

You have misunderstood a standard figure of speech. Here is the definition he was using: https://www.ldoceonline.com/dictionary/to-be-fair (see also https://dictionary.cambridge.org/us/dictionary/english/to-be-fair, which doesn't explicitly mention that it's typically used to offset criticisms but otherwise defines it more thoroughly).

adam-b on Introducing Fatebook: the fastest way to make and track predictions

Thanks Raemon – I'm a fan of ~all these ideas! I'm not spending much time on Fatebook specifically these days (busy with various other projects), but your feedback is and has been super useful.

richard_kennaway on Passages I Highlighted in The Letters of J.R.R.Tolkien

We are in our native language, so we should work from there.

And begin by stepping outside it.

richard_kennaway on Passages I Highlighted in The Letters of J.R.R.Tolkien

Gandalf as Ring-Lord would have been far worse than Sauron. He would have remained ‘righteous’, but self-righteous. He would have continued to rule and order things for ‘good’, and the benefit of his subjects according to his wisdom (which was and would have remained great). [The draft ends here. In the margin Tolkien wrote: ‘Thus while Sauron multiplied [illegible word] evil, he left “good” clearly distinguishable from it. Gandalf would have made good detestable and seem evil.’]

So Tolkien anticipated the rise of woke!

adam-b on Adam B's Shortform

Here's what the question creation interface looks like – you can create any question, and there's a bunch of suggestions to help you get started:

Here's what it looks like when you've created some questions. I made these predictions with my partner, so you can see her predictions on the question that's expanded:

For more info, I'd suggest just using the website itself. It is designed to be super easy to use and self-describing!

The website is part of Fatebook, a tool for rapidly tracking predictions. Predict Your Year is a specialised page for creating yearly predictions (which is a popular activity for many people, e.g. inspired by Scott Alexander's annual prediction posts). You can read more about Fatebook in this LessWrong post [LW · GW].