LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

[link] Transformer Debugger
Henk Tillman (henk-tillman) · 2024-03-12T19:08:56.280Z · comments (0)

Useful starting code for interpretability
eggsyntax · 2024-02-13T23:13:47.940Z · comments (2)

[link] Was Partisanship Good for the Environmental Movement?
Jeffrey Heninger (jeffrey-heninger) · 2024-05-15T17:30:54.796Z · comments (0)

A brief review of China's AI industry and regulations
Elliot Mckernon (elliot) · 2024-03-14T12:19:00.775Z · comments (0)

[link] Clickbait Soapboxing
DaystarEld · 2024-03-13T14:09:29.890Z · comments (15)

[link] Robert Caro And Mechanistic Models In Biography
adamShimi · 2024-07-14T10:56:42.763Z · comments (5)

[question] What percent of the sun would a Dyson Sphere cover?
Raemon · 2024-07-03T17:27:50.826Z · answers+comments (26)

[link] Altruism and Vitalism Aren't Fellow Travelers
Arjun Panickssery (arjun-panickssery) · 2024-08-09T02:01:11.361Z · comments (2)

[link] Cellular respiration as a steam engine
dkl9 · 2024-02-25T20:17:38.788Z · comments (1)

I didn't think I'd take the time to build this calibration training game, but with websim it took roughly 30 seconds, so here it is!
mako yass (MakoYass) · 2024-08-02T22:35:21.136Z · comments (2)

How Congressional Offices Process Constituent Communication
Tristan Williams (tristan-williams) · 2024-07-02T12:38:41.472Z · comments (0)

Scientific Method
Andrij “Androniq” Ghorbunov (andrij-androniq-ghorbunov) · 2024-02-18T21:06:45.228Z · comments (4)

Weeping Agents
pleiotroth · 2024-06-06T12:18:54.978Z · comments (2)

An evaluation of Helen Toner’s interview on the TED AI Show
PeterH · 2024-06-06T17:39:40.800Z · comments (2)

[link] Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger · 2024-07-19T14:07:25.162Z · comments (3)

[link] Scenario planning for AI x-risk
Corin Katzke (corin-katzke) · 2024-02-10T00:14:11.934Z · comments (12)

Language and Capabilities: Testing LLM Mathematical Abilities Across Languages
Ethan Edwards · 2024-04-04T13:18:54.909Z · comments (2)

[link] Let's Design A School, Part 2.3 School as Education - The Curriculum (Phase 2, Specific)
Sable · 2024-05-15T20:58:50.981Z · comments (0)

A Basic Economics-Style Model of AI Existential Risk
Rubi J. Hudson (Rubi) · 2024-06-24T20:26:09.744Z · comments (3)

[question] How should vegans think about Methionine needs?
ChristianKl · 2024-11-10T09:28:47.655Z · answers+comments (1)

AI Safety University Organizing: Early Takeaways from Thirteen Groups
agucova · 2024-10-02T15:14:00.137Z · comments (0)

Seeking Mechanism Designer for Research into Internalizing Catastrophic Externalities
c.trout (ctrout) · 2024-09-11T15:09:48.019Z · comments (2)

[link] "25 Lessons from 25 Years of Marriage" by honorary rationalist Ferrett Steinmetz
CronoDAS · 2024-10-02T22:42:30.509Z · comments (2)

[link] The Living Planet Index: A Case Study in Statistical Pitfalls
Jan_Kulveit · 2024-06-24T10:05:55.101Z · comments (0)

[link] Tokyo AI Safety 2025: Call For Papers
Blaine (blaine-rogers) · 2024-10-21T08:43:38.467Z · comments (0)

Foresight Institute: 2023 Progress & 2024 Plans for funding beneficial technology development
Allison Duettmann (allison-duettmann) · 2023-11-22T22:09:16.956Z · comments (1)

Evolution did a surprising good job at aligning humans...to social status
Eli Tyre (elityre) · 2024-03-10T19:34:52.544Z · comments (37)

Distinctions when Discussing Utility Functions
ozziegooen · 2024-03-09T20:14:03.592Z · comments (7)

Technology path dependence and evaluating expertise
bhauth · 2024-01-05T19:21:23.302Z · comments (2)

[link] The absence of self-rejection is self-acceptance
Chipmonk · 2023-12-21T21:54:52.116Z · comments (1)

aintelope project update
Gunnar_Zarncke · 2024-02-08T18:32:00.000Z · comments (2)

Building Trust in Strategic Settings
StrivingForLegibility · 2023-12-28T22:12:24.024Z · comments (0)

[link] Review of Alignment Plan Critiques- December AI-Plans Critique-a-Thon Results
Iknownothing · 2024-01-15T19:37:07.984Z · comments (0)

[link] Alignment work in anomalous worlds
Tamsin Leake (carado-1) · 2023-12-16T19:34:26.202Z · comments (4)

Even if we lose, we win
Morphism (pi-rogers) · 2024-01-15T02:15:43.447Z · comments (17)

Defense Against The Dark Arts: An Introduction
Lyrongolem (david-xiao) · 2023-12-25T06:36:06.278Z · comments (36)

D&D.Sci Hypersphere Analysis Part 2: Nonlinear Effects & Interactions
aphyer · 2024-01-14T19:59:37.911Z · comments (0)

Anomalous Concept Detection for Detecting Hidden Cognition
Paul Colognese (paul-colognese) · 2024-03-04T16:52:52.568Z · comments (3)

[question] Would you have a baby in 2024?
martinkunev · 2023-12-25T01:52:04.358Z · answers+comments (76)

[link] Secret US natsec project with intel revealed
Nathan Helm-Burger (nathan-helm-burger) · 2024-05-25T04:22:11.624Z · comments (0)

Paper Summary: The Koha Code - A Biological Theory of Memory
jakej (jake-jenks) · 2023-12-30T22:37:13.865Z · comments (2)

[link] Compensating for Life Biases
Jonathan Moregård (JonathanMoregard) · 2024-01-09T14:39:14.229Z · comments (6)

UDT1.01: Local Affineness and Influence Measures (2/10)
Diffractor · 2024-03-31T07:35:52.831Z · comments (0)

A conceptual precursor to today's language machines [Shannon]
Bill Benzon (bill-benzon) · 2023-11-15T13:50:51.226Z · comments (6)

[question] Could there be "natural impact regularization" or "impact regularization by default"?
tailcalled · 2023-12-01T22:01:46.062Z · answers+comments (6)

[link] Extinction Risks from AI: Invisible to Science?
VojtaKovarik · 2024-02-21T18:07:33.986Z · comments (7)

Population ethics and the value of variety
cousin_it · 2024-06-23T10:42:21.402Z · comments (11)

My Alignment "Plan": Avoid Strong Optimisation and Align Economy
VojtaKovarik · 2024-01-31T17:03:34.778Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

viliam on What are Emotions?

Emotions are about reality, but emotions are also a part of reality, so we also have emotions about emotions. I can feel happy about some good thing happening in the outside world. And, separately, I can feel happy about being happy.

In the thought experiments about wireheading, people often say that they don't just want to experience (possibly fake) happy thoughts about X; they also want X to actually happen.

But let's imagine the converse: what if someone proposed a surgery that would make you unable to ever feel happy about X, even if you knew that X actually happened in the world. People would probably refuse that, too. Intuitively, we want to feel good emotions that we "deserve", plus there is also the factor of motivation. Okay, so let's imagine a surgery that removes your ability to feel happy about X, but solves the problem of motivation by e.g. giving you an urge to do X. People would probably refuse that, too.

So I think we actually want both the emotions and the things the emotions are about.

mr-hire on Matt Goldenberg's Short Form Feed

A lot of people are looking at the implications of o1's training process as a future scaling paradigm, but it seems to me that this implementation of applying inference time compute to just in time fine tune the model for hard questions is equally promising and may have equally impressive results if it scales with compute, and has equal potential in terms of low hanging fruit to be picked to improve it.

Don't sleep on test time training as a potential future scaling paradigm.

turntrout on Announcing turntrout.com, my new digital home

IIRC my site checks (in descending priority):

localStorage to see if they've already told my site a light/dark preference;
whether the user's browser indicates a global light/dark preference (this is the "auto");
if there's no preference, the site defaults to light.

The idea is "I'll try doing the right thing (auto), and if the user doesn't like it they can change it and I'll listen to that choice." Possibly it will still be counterintuitive to many folks, as Said quoted in a sibling comment.

viliam on What are some positive developments in AI safety in 2024?

Welp, this was a short list.

viliam on Neutrality

Speaking only for myself, I can agree with the abstract approach (therefore: upvote), but I am not familiar with any of the existing projects mentioned in the article (therefore: no vote; because I have no idea how useful the projects actually are, and thus how useful is the list of them).

anders-lindstroem on "It's a 10% chance which I did 10 times, so it should be 100%"

Why would is the expectation to find a polyamorous partner be higher in the case you gave? Same chance per try and same number of tries should equal same expectation.

jeremy-gillen on "It's a 10% chance which I did 10 times, so it should be 100%"

Nice.

Similar rule of thumb I find handy is divide by 70 to get doubling time implied by a growth rate. I find it way easier to think about doubling times than growth rates.

E.g. 3% interest rate means 70/3 ≈ 23 year doubling time.

davekasten on Proposing the Conditional AI Safety Treaty (linkpost TIME)

Is there a longer-form version with draft treaty langugage (even an outline)? I'd be curious to read it.

gunnar_zarncke on Gunnar_Zarncke's Shortform

agents that have preferences about the state of the world in the distant future

What are these preferences? For biological agents, these preferences are grounded in some mechanism - what you call Steering System - that evaluates "desirable states" of the world in some more or less directly measurable way (grounded in perception via the senses) and derives a signal of how desirable the state is, which the brain is optimizing for. For ML models, the mechanism is somewhat different but there is also an input to the training algorithm that determines how "good" the output is. This signal is called reward and drives the system toward outputs that lead to states of high reward. But the path there depends on the specific optimization method and the algorithm has to navigate such a complex loss landscape that it can get stuck in areas of the search space that correspond to imperfect models for very long if not for ever. These imperfect models can be off in significant ways and that's why it may be useful to say that Reward is not the optimization target [LW · GW].

The connection to Intuitive Self-Models is that even though the internal models of an LLM may be very different from human self-models, I think it is still quite plausible that LLMs and other models form models of the self. Such models are instrumentally convergent [? · GW]. Humans talk about the self. The LLM does things that matches these patterns. Maybe the underlying process in humans that give rise to this is different, but humans learning about this can't know the actual process either. And in the same way the approximate model the LLM forms is not maximizing the reward signal but can be quite far from it as long it is useful (in the sense of having higher reward than other such models/parameter combinations).

I think of my toenail as “part of myself”, but I’m happy to clip it.

Sure, the (body of the) self can include parts that can be cut/destroyed without that "causing harm" but instead having an overall positive effect. The AI in a compute center would in analogy also consider decommissioning failed hardware. And when defining humanity, we do have to be careful what we mean when these "parts" could be humans.

p-b-1 on "It's a 10% chance which I did 10 times, so it should be 100%"

Related: https://en.wikipedia.org/wiki/Secretary_problem