LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

How might we solve the alignment problem? (Part 1: Intro, summary, ontology)
Joe Carlsmith (joekc) · 2024-10-28T21:57:12.063Z · comments (5)

Provably Safe AI: Worldview and Projects
bgold · 2024-08-09T23:21:02.763Z · comments (43)

[link] Prices are Bounties
Maxwell Tabarrok (maxwell-tabarrok) · 2024-10-12T14:51:40.689Z · comments (13)

How to Give in to Threats (without incentivizing them)
Mikhail Samin (mikhail-samin) · 2024-09-12T15:55:50.384Z · comments (26)

Llama Llama-3-405B?
Zvi · 2024-07-24T19:40:07.565Z · comments (9)

Model evals for dangerous capabilities
Zach Stein-Perlman · 2024-09-23T11:00:00.866Z · comments (9)

[link] Anthropic's updated Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-10-15T16:46:48.727Z · comments (3)

[Intuitive self-models] 6. Awakening / Enlightenment / PNSE
Steven Byrnes (steve2152) · 2024-10-22T13:23:08.836Z · comments (5)

Sherlockian Abduction Master List
Cole Wyeth (Amyr) · 2024-07-11T20:27:00.000Z · comments (63)

Unlearning via RMU is mostly shallow
Andy Arditi (andy-arditi) · 2024-07-23T16:07:52.223Z · comments (3)

Consent across power differentials
Ramana Kumar (ramana-kumar) · 2024-07-09T11:42:03.177Z · comments (12)

AI #82: The Governor Ponders
Zvi · 2024-09-19T13:30:04.863Z · comments (8)

Applications of Chaos: Saying No (with Hastings Greer)
Elizabeth (pktechgirl) · 2024-09-21T16:30:07.415Z · comments (16)

[link] Can AI Outpredict Humans? Results From Metaculus's Q3 AI Forecasting Benchmark
ChristianWilliams · 2024-10-10T18:58:46.041Z · comments (2)

An issue with training schemers with supervised fine-tuning
Fabien Roger (Fabien) · 2024-06-27T15:37:56.020Z · comments (12)

Interoperable High Level Structures: Early Thoughts on Adjectives
johnswentworth · 2024-08-22T21:12:38.223Z · comments (1)

Book Review: Righteous Victims - A History of the Zionist-Arab Conflict
Yair Halberstadt (yair-halberstadt) · 2024-06-24T11:02:03.490Z · comments (8)

So you want to work on technical AI safety
gw · 2024-06-24T14:29:57.481Z · comments (3)

The Fragility of Life Hypothesis and the Evolution of Cooperation
KristianRonn · 2024-09-04T21:04:49.878Z · comments (6)

[link] On scalable oversight with weak LLMs judging strong LLMs
zac_kenton (zkenton) · 2024-07-08T08:59:58.523Z · comments (18)

[LDSL#0] Some epistemological conundrums
tailcalled · 2024-08-07T19:52:55.688Z · comments (10)

Low Probability Estimation in Language Models
Gabriel Wu (gabriel-wu) · 2024-10-18T15:50:05.947Z · comments (0)

[link] DM Parenting
Shoshannah Tekofsky (DarkSym) · 2024-07-16T08:50:08.144Z · comments (4)

[link] The Evals Gap
Marius Hobbhahn (marius-hobbhahn) · 2024-11-11T16:42:46.287Z · comments (7)

Evaluating the truth of statements in a world of ambiguous language.
Hastings (hastings-greer) · 2024-10-07T18:08:09.920Z · comments (19)

SRE's review of Democracy
Martin Sustrik (sustrik) · 2024-08-03T07:20:01.483Z · comments (2)

[question] If I wanted to spend WAY more on AI, what would I spend it on?
Logan Zoellner (logan-zoellner) · 2024-09-15T21:24:46.742Z · answers+comments (16)

An alternative approach to superbabies
Towards_Keeperhood (Simon Skade) · 2024-11-05T22:56:15.740Z · comments (19)

[link] Book review: Xenosystems
jessicata (jessica.liu.taylor) · 2024-09-16T20:17:56.670Z · comments (18)

AI and the Technological Richter Scale
Zvi · 2024-09-04T14:00:08.625Z · comments (8)

Interested in Cognitive Bootcamp?
Raemon · 2024-09-19T22:12:13.348Z · comments (0)

[link] Active Recall and Spaced Repetition are Different Things
Saul Munn (saul-munn) · 2024-11-08T20:14:56.092Z · comments (2)

[link] JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan (SenR) · 2024-07-19T16:10:54.664Z · comments (10)

[link] Contra Acemoglu on AI
Maxwell Tabarrok (maxwell-tabarrok) · 2024-06-28T13:13:15.796Z · comments (0)

Why the Best Writers Endure Isolation
Declan Molony (declan-molony) · 2024-07-16T05:58:25.032Z · comments (6)

Misnaming and Other Issues with OpenAI's “Human Level” Superintelligence Hierarchy
Davidmanheim · 2024-07-15T05:50:17.770Z · comments (2)

Extended Interview with Zhukeepa on Religion
Ben Pace (Benito) · 2024-08-18T03:19:05.625Z · comments (59)

D&D.Sci Coliseum: Arena of Data Evaluation and Ruleset
aphyer · 2024-10-29T01:21:03.075Z · comments (12)

Toward Safety Case Inspired Basic Research
Lucas Teixeira · 2024-10-31T23:06:32.854Z · comments (2)

Caring about excellence
owencb · 2024-07-22T14:24:37.892Z · comments (4)

How do we know that "good research" is good? (aka "direct evaluation" vs "eigen-evaluation")
Ruby · 2024-07-19T00:31:38.332Z · comments (21)

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Marcus Williams · 2024-11-07T15:39:06.854Z · comments (6)

Demis Hassabis and Geoffrey Hinton Awarded Nobel Prizes
Anna Gajdova (anna-gajdova) · 2024-10-09T12:56:24.856Z · comments (14)

[Intuitive self-models] 8. Rooting Out Free Will Intuitions
Steven Byrnes (steve2152) · 2024-11-04T18:16:26.736Z · comments (9)

Bounty for Evidence on Some of Palisade Research's Beliefs
benwr · 2024-09-23T20:01:20.917Z · comments (4)

[link] Michael Dickens' Caffeine Tolerance Research
niplav · 2024-09-04T15:41:53.343Z · comments (3)

Humanity isn't remotely longtermist, so arguments for AGI x-risk should focus on the near term
Seth Herd · 2024-08-12T18:10:56.543Z · comments (10)

[link] Robin Hanson AI X-Risk Debate — Highlights and Analysis
Liron · 2024-07-12T21:31:02.222Z · comments (7)

Decision Theory in Space
lsusr · 2024-08-18T07:02:11.847Z · comments (18)

I finally got ChatGPT to sound like me
lsusr · 2024-09-17T09:39:59.415Z · comments (18)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

bogdan-ionut-cirstea on Alignment By Default

A few additional relevant recent papers: Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models, Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures.

Similarly, the argument in this post and e.g. in Robust agents learn causal world models seem to me to suggest that we should probably also expect something like universal (approximate) circuits, which it might be feasible to automate the discovery of using perhaps a similar procedure to the one demo-ed in Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.

aidan-o-gara on o1 is a bad idea

Process supervision seems like a plausible o1 training approach but I think it would conflict with this:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought.

I think it might just be outcome-based RL, training the CoT to maximize the probability of correct answers or maximize human preference reward model scores or minimize next-token entropy.

bensenberner on Open Thread Fall 2024

Sure!

notfnofn on Flipping Out: The Cosmic Coinflip Thought Experiment Is Bad Philosophy

Are you familiar with Kelly betting? The point of maximizing log expectation instead of pure expectation isn't because happiness grows on a logarithmic scale or whatever, it's for the sake of maximizing long-term expected value. This kills off making bets where "0" is on the table (as log(0) is minus infinity); whether or not that's appropriate is still an interesting topic for discussion because, as you mentioned, x-risks exist anyway

nathan-helm-burger on Theories With Mentalistic Atoms Are As Validly Called Theories As Theories With Only Non-Mentalistic Atoms

I think AI safety isn't as much a matter of government policy as you seem to think. Currently, sure. Frontier models are so expensive to train only the big labs can do it. Models have limited agentic capabilities, even at the frontier.

But we are rushing towards a point where science makes intelligence and learning better understood. Open source models are getting rapidly more powerful and cheap.

In a few years, the yrend suggests that any individual could create a dangerously powerful AI using a personal computer.

Any law which fails to protect society if even a single individual chooses to violate it once... Is not a very protective law. Historical evidence suggests that occasionally some people break laws. Especially when there's a lot of money and power on offer in exchange for the risk.

What happens at that point depends a lot on the details of the lawbreaker's creation. With what probability will it end up agentic, coherent, conscious, self-improvement capable, escape and self-replication capable, Omohundro goal driven (survival focused, resource and power hungry), etc...

The probability seems unlikely to me to be zero for the sorts of qualities which would make such an AI agent dangerous. Then we must ask questions about the efficacy of governments in detecting and stopping such AI agents before they become catastrophically powerful.

kylefurlong on The Humanitarian Economy

I find this comment flippant and unworthy of a community like LessWrong. First of all, you’re denying the politics of millions of earnest people, many as educated and gifted as you, and second of all, you’re equating a 21st century democratically steered market economy with the totalitarian central planning of 20th century Stalinism. You’re right that no one wants that.

tristan-tran on LifeKeeper Diaries: Exploring Misaligned AI Through Interactive Fiction

I love this! Thank you for the feedback.

We could definitely build some more plot into the narration engine. Right now it's a pretty simple concept but I love this direction

kylefurlong on The Humanitarian Economy

You’re absolutely right, the post is light on details. To answer a few of your points: I don’t have a deep understanding of housing market dynamics beyond the bad deals and pressures I’ve heard about from many different people, especially in the Bay Area. If we were to develop this into a full proposal for public consumption it would include an analysis of how housing subsidy on both the demand and supply side would affect real outcomes. However, that’s somewhat beside the point, as that analysis has nothing to do with the soundness of the system as a whole, and actively denies the good it’s hoping to promulgate.

The debit card system is much less like food stamps than it is like dynamic UBI with constraints. You may have missed the part when a new corps of inspector-accountants validate businesses before they qualify to participate in the program. Once they do, they get a widget that adds a unique nonce to their transaction strings that the system validates. This solves a lot of problems you mention, also, the dynamic and constrained nature of the cards solve many of the issues people have with UBI: that people would spend on trivialities and non-essentials, and that it wouldn’t be enough to make a difference economically. Could someone buy $500 of alcohol with their five $100 retail transactions a week and drink themselves to death? Sure. That’s their prerogative, and their community Target worker (or wherever, just not the liquor store) could ask them if they’re ok the second time if they believe that shouldn’t happen. Further, if that’s not conscionable by a majority of people, the system could include disallowing drugs and alcohol from approved stores.

tag on Theories With Mentalistic Atoms Are As Validly Called Theories As Theories With Only Non-Mentalistic Atoms

It seems common for people trying to talk about AI extinction to get hung up on whether statements derived from abstract theories containing mentalistic atoms can have objective truth or falsity values. They can. And if we can first agree on such basic elements of our ontology/epistemology as that one agent can be objectively smarter than another, that we can know whether something that lives in a physical substrate that is unlike ours is conscious, and that there can be some degree of objective truth as to what is valuable [not that all beings that are merely intelligent will necessarily pursue these things], it in fact becomes much more natural to make clear statements and judgments in the abstract or general case, about what very smart non-aligned agents will in fact do to the physical world.

Why does any of that matter for AI safety? AI safety is a matter of public policy. In public policy making, you have a set of preferences, which you get from votes or surveys, and you formulate policy based on your best objective understanding of cause and effect. The preferences don't have to be objective, because they are taken as given. It's quite different to philosophy, because you are trying to achieve or avoid something, not figure out what something ultimately is. You do t have to answer Wolfram's questions in their own terms, because you can challenge the framing.

And if we can first agree on such basic elements of our ontology/epistemology as that one agent can be objectively smarter than another,

It's not all that relevant to AI safety, because an AI only needs some potentially dangerous capabilities. Admittedly, a lot of the literature gives the opposite impression.

that we can know whether something that lives in a physical substrate that is unlike ours is conscious,

You haven't defined consciousness and you haven't explained how . It doesn't follow automatically from considerations about intelligence. And it doesn't follow from having some mentalistic terms in our theories.

and that there can be some degree of objective truth as to what is valuable

there doesn't need to be. You don't have to solve ethics to set policy.

atillayasar on AtillaYasar's Shortform

Posts as nodes -- what's beautiful about LessWrong

I'm new to this site as a writer (and a writer in general), and I read LW's user guide, [LW · GW]to think more clearly about what kind of articles are expected and about why people are here. Direct quote:

LessWrong is a good place for someone who:
values curiosity, learning, self-improvement, figuring out what's actually true (rather than just what you want to be true or just winning arguments)
will change their mind or admit they're wrong in response to compelling evidence or argument
wants to work collaboratively with others to figure out what's true
likes acknowledging and quantifying uncertainty and applying lessons from probability, statistics, and decision theory to your reasoning
is nerdy and interested in all questions of how the world works and who is not afraid to reach weird conclusions if the arguments seem valid
likes to be pedantic and precise, and likes to bet on their beliefs
doesn't mind reading a lot

There is a style of communication and thought, that summarizes the spirit of most of these. It's when your presentation is structured like a graph of dependencies.

"I believe x because of y" is much better than "I believe x"

Changing your mind is built into this type of writing, it says that they believe x because they believe that y => x, and/or that they believe x to the degree that y is true. It allows collaboration, because someone else can point out other implications of y, or that y isn't true, or that y => x doesn't hold, etc., and they have to change their mind. Or put in another way:

When you make the graph explicit, including the edges, the audience can judge the way things are connected, in addition to the conclusion.

It's like open-sourcing the code (where your conclusion is like the "app").

(and you can arrive at weird conclusions for free, since you're simply following the graph)

But what about exploratory thinking?

You simply take your exploratory post, identify the parts that are solid, refactor the post into self-contained things with explicit paths of reasoning, and you take those posts as nodes to reason about and speculate about!

(maybe a better way to put it is that it encourages factoring out solid elements of an idea or conclusion, and that you have the Quick Takes feature for doing exploration?? idk this second part is way less solid lol)