LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] AI forecasting bots incoming
Dan H (dan-hendrycks) · 2024-09-09T19:14:31.050Z · comments (44)

Experience Report - ML4Good AI Safety Bootcamp
Kieron Kretschmar · 2024-04-11T18:03:41.040Z · comments (0)

Weekly newsletter for AI safety events and training programs
Bryce Robertson (bryceerobertson) · 2024-05-03T00:33:29.418Z · comments (0)

Deception Chess: Game #2
Zane · 2023-11-29T02:43:22.375Z · comments (17)

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
Joe Carlsmith (joekc) · 2023-11-29T16:32:30.068Z · comments (1)

[link] The Poker Theory of Poker Night
omark · 2024-04-07T09:47:01.658Z · comments (13)

Reviewing the Structure of Current AI Regulations
Deric Cheng (deric-cheng) · 2024-05-07T12:34:17.820Z · comments (0)

[question] Weighing reputational and moral consequences of leaving Russia or staying
spza · 2024-02-18T19:36:40.676Z · answers+comments (24)

AI #61: Meta Trouble
Zvi · 2024-05-02T18:40:03.242Z · comments (0)

[link] My MATS Summer 2023 experience
James Chua (james-chua) · 2024-03-20T11:26:14.944Z · comments (0)

Wholesome Culture
owencb · 2024-03-01T12:08:17.877Z · comments (3)

End-to-end hacking with language models
tchauvin (timot.cool) · 2024-04-05T15:06:53.689Z · comments (0)

Results from the Turing Seminar hackathon
Charbel-Raphaël (charbel-raphael-segerie) · 2023-12-07T14:50:38.377Z · comments (1)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (7)

Tackling Moloch: How YouCongress Offers a Novel Coordination Mechanism
Hector Perez Arenas (hector-perez-arenas) · 2024-05-15T23:13:48.501Z · comments (9)

[link] A new process for mapping discussions
Nathan Young · 2024-09-30T08:57:20.029Z · comments (7)

Reading More Each Day: A Simple $35 Tool
aysajan · 2024-07-24T13:54:04.290Z · comments (2)

[link] AI Impacts 2023 Expert Survey on Progress in AI
habryka (habryka4) · 2024-01-05T19:42:17.226Z · comments (1)

[link] ML Safety Research Advice - GabeM
Gabe M (gabe-mukobi) · 2024-07-23T01:45:42.288Z · comments (2)

Aggregative principles approximate utilitarian principles
Cleo Nardo (strawberry calm) · 2024-06-12T16:27:22.179Z · comments (3)

Updates to Open Phil’s career development and transition funding program
abergal · 2023-12-04T18:10:29.394Z · comments (0)

Ackshually, many worlds is wrong
tailcalled · 2024-04-11T20:23:59.416Z · comments (42)

Escaping Skeuomorphism
Stuart Johnson (stuart-johnson) · 2023-12-20T03:51:00.489Z · comments (0)

Heuristics for preventing major life mistakes
SK2 (lunchbox) · 2023-12-20T08:01:09.340Z · comments (2)

Employee Incentives Make AGI Lab Pauses More Costly
nikola (nikolaisalreadytaken) · 2023-12-22T05:04:15.598Z · comments (12)

[link] Cellular reprogramming, pneumatic launch systems, and terraforming Mars: Some things I learned about at Foresight Vision Weekend
jasoncrawford · 2024-01-04T19:33:57.887Z · comments (0)

AI #90: The Wall
Zvi · 2024-11-14T14:10:04.562Z · comments (6)

[link] Conversation Visualizer
ethanmorse · 2023-12-31T01:18:01.424Z · comments (4)

Can quantised autoencoders find and interpret circuits in language models?
charlieoneill (kingchucky211) · 2024-03-24T20:05:50.125Z · comments (4)

Evaporation of improvements
Viliam · 2024-06-20T18:34:40.969Z · comments (27)

AI #64: Feel the Mundane Utility
Zvi · 2024-05-16T15:20:02.956Z · comments (11)

Cryonics p(success) estimates are only weakly associated with interest in pursuing cryonics in the LW 2023 Survey
Andy_McKenzie · 2024-02-29T14:47:28.613Z · comments (6)

Auditing LMs with counterfactual search: a tool for control and ELK
Jacob Pfau (jacob-pfau) · 2024-02-20T00:02:09.575Z · comments (6)

Solstice 2023 Roundup
dspeyer · 2023-10-11T23:09:08.252Z · comments (6)

[link] If-Then Commitments for AI Risk Reduction [by Holden Karnofsky]
habryka (habryka4) · 2024-09-13T19:38:53.194Z · comments (0)

Collection (Part 6 of "The Sense Of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-03-14T21:37:00.160Z · comments (0)

Trading Candy
jefftk (jkaufman) · 2024-11-01T01:10:08.024Z · comments (4)

An Affordable CO2 Monitor
Pretentious Penguin (dylan-mahoney) · 2024-03-21T03:06:53.255Z · comments (1)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem
Nora_Ammann · 2023-10-26T14:38:14.916Z · comments (4)

[link] Quick Thoughts on Scaling Monosemanticity
Joel Burget (joel-burget) · 2024-05-23T16:22:48.035Z · comments (1)

[link] AI Safety at the Frontier: Paper Highlights, August '24
gasteigerjo · 2024-09-03T19:17:24.850Z · comments (0)

Cicadas, Anthropic, and the bilateral alignment problem
kromem · 2024-05-22T11:09:56.469Z · comments (6)

[link] New blog: Expedition to the Far Lands
Connor Leahy (NPCollapse) · 2024-08-17T11:07:48.537Z · comments (3)

Monthly Roundup #19: June 2024
Zvi · 2024-06-25T12:00:03.333Z · comments (9)

Childhood and Education Roundup #6: College Edition
Zvi · 2024-06-26T11:40:03.990Z · comments (8)

AI #65: I Spy With My AI
Zvi · 2024-05-23T12:40:02.793Z · comments (7)

Deconfusing “ontology” in AI alignment
Dylan Bowman (dylan-bowman) · 2023-11-08T20:03:43.205Z · comments (3)

Online Dialogues Party — Sunday 5th November
Ben Pace (Benito) · 2023-10-27T02:41:00.506Z · comments (1)

[link] Memo on some neglected topics
Lukas Finnveden (Lanrian) · 2023-11-11T02:01:55.834Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

richard_kennaway on Monthly Roundup #24: November 2024

I cannot stress enough how much
what Neurotypical people call
"overthinking,"

Is just what Neurodivergent folks call
"thinking."

Actual thinking looks like overthinking to the hard of thinking.

jonah-wilberg on Ethical Implications of the Quantum Multiverse

OK I think I see where you're coming from - but I do think the unimaginable bigness of the universe has more 'irrelevance' implications for a consequentialist view which tries to consider valuable states of the universe than for a virtue approach which considers valuable states of yourself. Also if you think the implication of physics is that everything is irrelevant, that seems like an important implication in it's own right, and different from 'normality' (the normal way most people think about ethics, which assumes that some things actually are relevant).

donatas-luciunas on Claude seems to be smarter than LessWrong community

I don't agree.

We understand intelligence as a capability to estimate many outcomes and perform actions that will lead to the best outcome. Now the question is - how to calculate goodness of the outcome.

According to you - current utility function should be used.
According to me - utility function that will be in effect at the time when outcome is achieved should be used.

And I think I can prove that my calculation is more intelligent.

Let's say there is a paperclip maximizer. It just started, it does not really understand anything, it does not understand what a paperclip is.

According to you such paperclip maximizer will be absolutely reckless, he might destroy few paperclip factories just because it does not understand yet that they are useful for its goal. Current utility function does not assign value to paperclip factories.
According to me such paperclip maximizer will be cautious and will try to learn first without making too much changes. Because future utility function might assign value to things that currently don't seem valuable.

romeostevensit on Monthly Roundup #24: November 2024

what technologies like bbq are we missing?

It's also my litmus test for community, if a group can't succeed at casual BBQs at all or has them but they have to be a big production I am more wary.

ejt on The Shutdown Problem: Incomplete Preferences as a Solution

To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3 [LW · GW]), the training proposal will prevent agents learning those preferences. See in particular:

We begin training the agent to satisfy POST at the very beginning of the reinforcement learning stage, at which point it’s very unlikely to be deceptively aligned (and arguably doesn’t even deserve the label ‘agent’). And when we’re training for POST, every single episode-series is training the agent not to prefer any longer trajectory to any shorter trajectory. The discount factor is constantly teaching the agent this simple lesson.
Plausibly then, the agent won’t come to prefer any longer trajectory to any shorter trajectory. And then we can reason as follows. Since the agent doesn’t prefer any longer trajectory to any shorter trajectory:
it has no incentive to shift probability mass towards longer trajectories,
and hence has no incentive to prevent shutdown in deployment,
and hence has no incentive to preserve its ability to prevent shutdown in deployment,
and hence has no incentive to avoid being made to satisfy Timestep Dominance,
and hence has no incentive to pretend to satisfy Timestep Dominance in training.

I expect agents' not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that - e.g. - agents' capabilities will generalize from training to deployment, why do you think their not caring about shutdown won't?

I don't assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?

Your point about shutting down subagents is important and I'm not fully satisfied with my proposal on that point. I say a bit about it here [LW · GW].

gerardus-mercator on Claude seems to be smarter than LessWrong community

Well, the agent will presumably choose to align the decision with its current goal, since that's the best outcome by the standards of its current goal. (And also I would expect that the agent would self-destruct after 0.99 years to prevent its future self from minimizing paperclips, and/or create a successor agent to maximize paperclips.)
I'm interested to see where you're going with this.

daijin on daijin's Shortform

The sequences can be distilled down even further into a few sentences per article.

Starting with "The lens that sees its flaws": this distils down to: "The ability to apply science to our own thinking grants us the ability to counteract our own biases, which can be powerful." Statement by statement:

A lot of complex physics and neural processing is required for you to notice something simple, like that your shoelace is untied.
However, on top of noticing that your shoelace is untied, you can also comprehend the process of (noticing your shoelace is untied) - i.e. by listing the steps through which light reflects off your shoelace and your visual cortex engaging, etc.
The ability to consider the steps of our own thinking appears to be uniquely human.
If we recognise that our process of comprehension and understanding is potentially flawed, you can choose to consciously counteract it.
Science is repeatedly and deliberately making measurements of our own observations over time, attributing theories to those measurements, and constructing experiments to produce further measurements to potentially disprove those theories.
The ability to apply science to our own thinking grants us the ability to counteract our own biases, which can be powerful.
- One example of reflective correction is correcting for optimism by noticing that optimism is not correlated to good outcomes.

The tool I am using to distill the sequences is an outliner: a nested bulleted list that allows rearranging of bullet points. This tool is typically used for writing things, but can similarly be used for un-writing things: taking a written article in and deduplicating its points, one bullet at a time, into a simpler format. An outliner can also collapse and reveal bullet points.

lalartu on How likely is brain preservation to work?

So people who say “brain preservation [or cryonics] doesn’t work” are either confused or >>making a very strong claim about neuroscience (that molecular structure doesn’t encode >>identity/memories) and/or future technology (that we can know with certainty what future >>technology won’t be able to do)

That is not entirely true. Some people who say "cryonics doesn't work" mean "identity is irretrievably lost when the brain activity stops, and in the best case you will have a different person with the same memory and personality traits". Since that argument doesn't give any testable predictions, it cannot be disproved.

zane on Monthly Roundup #24: November 2024

He said it was him on Joe Rogan's podcast.

dakara on The Plan - 2023 Version

That's a really good point. I would like to see John address it, because it seems quite crucial for the overall alignment plan.