LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (4)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (9)

0.202 Bits of Evidence In Favor of Futarchy
niplav · 2024-09-29T21:57:59.896Z · comments (0)

[link] Characterizing stable regions in the residual stream of LLMs
Jett Janiak (jett) · 2024-09-26T13:44:58.792Z · comments (4)

[link] Generative ML in chemistry is bottlenecked by synthesis
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-16T16:31:34.801Z · comments (2)

Book Review: On the Edge: The Business
Zvi · 2024-09-25T12:20:06.230Z · comments (0)

Drug development costs can range over two orders of magnitude
rossry · 2024-11-03T23:13:17.685Z · comments (0)

[link] On what research policymakers actually need
MondSemmel · 2024-04-23T19:50:12.833Z · comments (0)

Inducing Unprompted Misalignment in LLMs
Sam Svenningsen (sven) · 2024-04-19T20:00:58.067Z · comments (6)

Evaluating Sparse Autoencoders with Board Game Models
Adam Karvonen (karvonenadam) · 2024-08-02T19:50:21.525Z · comments (1)

Stop talking about p(doom)
Isaac King (KingSupernova) · 2024-01-01T10:57:28.636Z · comments (22)

[link] Things You're Allowed to Do: At the Dentist
rbinnn · 2024-01-28T18:39:33.584Z · comments (16)

[question] How would you navigate a severe financial emergency with no help or resources?
Tigerlily · 2024-05-02T18:27:51.329Z · answers+comments (22)

Making a Secular Solstice Songbook
jefftk (jkaufman) · 2024-01-23T19:40:05.055Z · comments (6)

[link] [Linkpost] George Mack's Razors
trevor (TrevorWiesinger) · 2023-11-27T17:53:45.065Z · comments (8)

From Finite Factors to Bayes Nets
J Bostock (Jemist) · 2024-01-23T20:03:51.845Z · comments (7)

Monthly Roundup #14: January 2024
Zvi · 2024-01-24T12:50:09.231Z · comments (22)

[link] The consistent guessing problem is easier than the halting problem
jessicata (jessica.liu.taylor) · 2024-05-20T04:02:03.865Z · comments (5)

International Scientific Report on the Safety of Advanced AI: Key Information
Aryeh Englander (alenglander) · 2024-05-18T01:45:10.194Z · comments (0)

AI #48: The Talk of Davos
Zvi · 2024-01-25T16:20:26.625Z · comments (9)

Losing Faith In Contrarianism
omnizoid · 2024-04-25T20:53:34.842Z · comments (44)

[link] Win Friends and Influence People Ch. 2: The Bombshell
gull · 2024-01-28T21:40:47.986Z · comments (13)

[link] Tinker
Richard_Ngo (ricraz) · 2024-04-16T18:26:38.679Z · comments (0)

[link] ∀: a story
Richard_Ngo (ricraz) · 2023-12-17T22:42:32.857Z · comments (1)

[question] Is a random box of gas predictable after 20 seconds?
Thomas Kwa (thomas-kwa) · 2024-01-24T23:00:53.184Z · answers+comments (35)

Medical Roundup #2
Zvi · 2024-04-09T13:40:05.908Z · comments (18)

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI
WillPetillo · 2023-12-04T22:58:40.005Z · comments (0)

Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs
Burny · 2023-11-23T03:16:09.358Z · comments (25)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
cmathw · 2024-04-08T11:14:43.268Z · comments (4)

Effectively Handling Disagreements - Introducing a New Workshop
Camille Berger (Camille Berger) · 2024-04-15T16:33:50.339Z · comments (2)

[link] A High Decoupling Failure
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-14T19:46:09.552Z · comments (5)

[link] WSJ: Inside Amazon’s Secret Operation to Gather Intel on Rivals
trevor (TrevorWiesinger) · 2024-04-23T21:33:08.049Z · comments (5)

Principles For Product Liability (With Application To AI)
johnswentworth · 2023-12-10T21:27:41.403Z · comments (55)

Striking Implications for Learning Theory, Interpretability — and Safety?
RogerDearnaley (roger-d-1) · 2024-01-05T08:46:58.915Z · comments (4)

[link] Dark Skies Book Review
PeterMcCluskey · 2023-12-29T18:28:59.352Z · comments (3)

My best guess at the important tricks for training 1L SAEs
Arthur Conmy (arthur-conmy) · 2023-12-21T01:59:06.208Z · comments (4)

Enhancing intelligence by banging your head on the wall
Bezzi · 2023-12-12T21:00:48.584Z · comments (26)

[link] The Hippie Rabbit Hole -Nuggets of Gold in Rivers of Bullshit
Jonathan Moregård (JonathanMoregard) · 2024-01-05T18:27:01.769Z · comments (20)

Thousands of malicious actors on the future of AI misuse
Zershaaneh Qureshi (zershaaneh-qureshi) · 2024-04-01T10:08:42.357Z · comments (0)

[link] [Fiction] A Confession
Arjun Panickssery (arjun-panickssery) · 2024-04-18T16:28:48.194Z · comments (2)

[question] Is there software to practice reading expressions?
lsusr · 2024-04-23T21:53:00.679Z · answers+comments (10)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

Review Report of Davidson on Takeoff Speeds (2023)
Trent Kannegieter · 2023-12-22T18:48:55.983Z · comments (11)

Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-01-02T18:15:54.168Z · comments (0)

AI #49: Bioweapon Testing Begins
Zvi · 2024-02-01T15:30:04.690Z · comments (11)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

richard_kennaway on Richard_Kennaway's Shortform

The blind have seeing-eye dogs. Terry Pratchett gave Foul Ole Ron a thinking-brain dog. At last, a serious use-case for LLMs! Thinking-brain dogs for the hard of thinking!

richard_kennaway on Monthly Roundup #24: November 2024

I cannot stress enough how much
what Neurotypical people call
"overthinking,"

Is just what Neurodivergent folks call
"thinking."

Actual thinking looks like overthinking to the hard of thinking.

jonah-wilberg on Ethical Implications of the Quantum Multiverse

OK I think I see where you're coming from - but I do think the unimaginable bigness of the universe has more 'irrelevance' implications for a consequentialist view which tries to consider valuable states of the universe than for a virtue approach which considers valuable states of yourself. Also if you think the implication of physics is that everything is irrelevant, that seems like an important implication in it's own right, and different from 'normality' (the normal way most people think about ethics, which assumes that some things actually are relevant).

donatas-luciunas on Claude seems to be smarter than LessWrong community

I don't agree.

We understand intelligence as a capability to estimate many outcomes and perform actions that will lead to the best outcome. Now the question is - how to calculate goodness of the outcome.

According to you - current utility function should be used.
According to me - utility function that will be in effect at the time when outcome is achieved should be used.

And I think I can prove that my calculation is more intelligent.

Let's say there is a paperclip maximizer. It just started, it does not really understand anything, it does not understand what a paperclip is.

According to you such paperclip maximizer will be absolutely reckless, he might destroy few paperclip factories just because it does not understand yet that they are useful for its goal. Current utility function does not assign value to paperclip factories.
According to me such paperclip maximizer will be cautious and will try to learn first without making too much changes. Because future utility function might assign value to things that currently don't seem valuable.

romeostevensit on Monthly Roundup #24: November 2024

what technologies like bbq are we missing?

It's also my litmus test for community, if a group can't succeed at casual BBQs at all or has them but they have to be a big production I am more wary.

ejt on The Shutdown Problem: Incomplete Preferences as a Solution

To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3 [LW · GW]), the training proposal will prevent agents learning those preferences. See in particular:

We begin training the agent to satisfy POST at the very beginning of the reinforcement learning stage, at which point it’s very unlikely to be deceptively aligned (and arguably doesn’t even deserve the label ‘agent’). And when we’re training for POST, every single episode-series is training the agent not to prefer any longer trajectory to any shorter trajectory. The discount factor is constantly teaching the agent this simple lesson.
Plausibly then, the agent won’t come to prefer any longer trajectory to any shorter trajectory. And then we can reason as follows. Since the agent doesn’t prefer any longer trajectory to any shorter trajectory:
it has no incentive to shift probability mass towards longer trajectories,
and hence has no incentive to prevent shutdown in deployment,
and hence has no incentive to preserve its ability to prevent shutdown in deployment,
and hence has no incentive to avoid being made to satisfy Timestep Dominance,
and hence has no incentive to pretend to satisfy Timestep Dominance in training.

I expect agents' not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that - e.g. - agents' capabilities will generalize from training to deployment, why do you think their not caring about shutdown won't?

I don't assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?

Your point about shutting down subagents is important and I'm not fully satisfied with my proposal on that point. I say a bit about it here [LW · GW].

gerardus-mercator on Claude seems to be smarter than LessWrong community

Well, the agent will presumably choose to align the decision with its current goal, since that's the best outcome by the standards of its current goal. (And also I would expect that the agent would self-destruct after 0.99 years to prevent its future self from minimizing paperclips, and/or create a successor agent to maximize paperclips.)
I'm interested to see where you're going with this.

daijin on daijin's Shortform

The sequences can be distilled down even further into a few sentences per article.

Starting with "The lens that sees its flaws": this distils down to: "The ability to apply science to our own thinking grants us the ability to counteract our own biases, which can be powerful." Statement by statement:

A lot of complex physics and neural processing is required for you to notice something simple, like that your shoelace is untied.
However, on top of noticing that your shoelace is untied, you can also comprehend the process of (noticing your shoelace is untied) - i.e. by listing the steps through which light reflects off your shoelace and your visual cortex engaging, etc.
The ability to consider the steps of our own thinking appears to be uniquely human.
If we recognise that our process of comprehension and understanding is potentially flawed, you can choose to consciously counteract it.
Science is repeatedly and deliberately making measurements of our own observations over time, attributing theories to those measurements, and constructing experiments to produce further measurements to potentially disprove those theories.
The ability to apply science to our own thinking grants us the ability to counteract our own biases, which can be powerful.
- One example of reflective correction is correcting for optimism by noticing that optimism is not correlated to good outcomes.

The tool I am using to distill the sequences is an outliner: a nested bulleted list that allows rearranging of bullet points. This tool is typically used for writing things, but can similarly be used for un-writing things: taking a written article in and deduplicating its points, one bullet at a time, into a simpler format. An outliner can also collapse and reveal bullet points.

lalartu on How likely is brain preservation to work?

So people who say “brain preservation [or cryonics] doesn’t work” are either confused or >>making a very strong claim about neuroscience (that molecular structure doesn’t encode >>identity/memories) and/or future technology (that we can know with certainty what future >>technology won’t be able to do)

That is not entirely true. Some people who say "cryonics doesn't work" mean "identity is irretrievably lost when the brain activity stops, and in the best case you will have a different person with the same memory and personality traits". Since that argument doesn't give any testable predictions, it cannot be disproved.

zane on Monthly Roundup #24: November 2024

He said it was him on Joe Rogan's podcast.