LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

What does davidad want from «boundaries»?
Chipmonk · 2024-02-06T17:45:42.348Z · comments (1)

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
Gytis Daujotas (gytis-daujotas) · 2024-08-01T21:08:38.800Z · comments (6)

[link] Dangerous capability tests should be harder
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:20:50.610Z · comments (3)

[link] Against Student Debt Cancellation From All Sides of the Political Compass
Maxwell Tabarrok (maxwell-tabarrok) · 2024-05-13T14:55:57.525Z · comments (16)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (192)

Monthly Roundup #24: November 2024
Zvi · 2024-11-18T13:20:06.086Z · comments (14)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
Rubi J. Hudson (Rubi) · 2024-07-16T22:44:17.128Z · comments (27)

[link] The Choice Transition
owencb · 2024-11-18T12:30:56.198Z · comments (4)

[link] Level up your spreadsheeting
angelinahli · 2024-05-25T14:57:19.730Z · comments (11)

[link] Literacy Rates Haven't Fallen By 20% Since the Department of Education Was Created
Maxwell Tabarrok (maxwell-tabarrok) · 2024-11-22T20:53:59.007Z · comments (0)

Announcing Atlas Computing
miyazono · 2024-04-11T15:56:31.241Z · comments (4)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

D&D.Sci Long War: Defender of Data-mocracy Evaluation & Ruleset
aphyer · 2024-05-14T03:35:10.586Z · comments (3)

[link] Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI
Connor Leahy (NPCollapse) · 2024-12-02T13:28:57.977Z · comments (10)

Preppers Are Too Negative on Objects
jefftk (jkaufman) · 2024-12-18T02:30:01.854Z · comments (2)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

ProLU: A Nonlinearity for Sparse Autoencoders
Glen Taggart · 2024-04-23T14:09:21.592Z · comments (4)

How difficult is AI Alignment?
Sammy Martin (SDM) · 2024-09-13T15:47:10.799Z · comments (6)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
Sonia Joseph (redhat) · 2024-03-13T17:09:17.027Z · comments (13)

The need for multi-agent experiments
Martín Soto (martinsq) · 2024-08-01T17:14:16.590Z · comments (3)

Locating My Eyes (Part 3 of "The Sense of Physical Necessity")
LoganStrohl (BrienneYudkowsky) · 2024-02-29T03:09:25.810Z · comments (4)

Which LessWrong/Alignment topics would you like to be tutored in? [Poll]
Ruby · 2024-09-19T01:35:02.999Z · comments (12)

[question] Does reducing the amount of RL for a given capability level make AI safer?
Chris_Leong · 2024-05-05T17:04:01.799Z · answers+comments (22)

Understanding Positional Features in Layer 0 SAEs
bilalchughtai (beelal) · 2024-07-29T09:36:40.701Z · comments (0)

[link] Began a pay-on-results coaching experiment, made $40,300 since July
Chipmonk · 2024-12-29T21:12:02.574Z · comments (14)

New Executive Team & Board — PIBBSS
Nora_Ammann · 2024-07-01T19:30:45.261Z · comments (1)

Sci-Fi books micro-reviews
Yair Halberstadt (yair-halberstadt) · 2024-06-24T09:49:28.523Z · comments (27)

[link] Oppression and production are competing explanations for wealth inequality.
Benquo · 2025-01-05T14:13:15.398Z · comments (15)

[link] Review: Good Strategy, Bad Strategy
L Rudolf L (LRudL) · 2024-12-21T17:17:04.342Z · comments (0)

Claude's Constitutional Consequentialism?
1a3orn · 2024-12-19T19:53:33.254Z · comments (6)

Analysis of Global AI Governance Strategies
Sammy Martin (SDM) · 2024-12-04T10:45:25.311Z · comments (10)

Why does generalization work?
Martín Soto (martinsq) · 2024-02-20T17:51:10.424Z · comments (16)

Job Listing: Managing Editor / Writer
Gretta Duleba (gretta-duleba) · 2024-02-21T23:41:26.818Z · comments (2)

[question] Where is the Town Square?
Gretta Duleba (gretta-duleba) · 2024-02-13T03:53:18.205Z · answers+comments (8)

[Aspiration-based designs] 1. Informal introduction
B Jacobs (Bob Jacobs) · 2024-04-28T13:00:43.268Z · comments (4)

Childhood and Education Roundup #4
Zvi · 2024-01-30T13:50:06.033Z · comments (10)

Games for AI Control
charlie_griffin (cjgriffin) · 2024-07-11T18:40:50.607Z · comments (0)

Concrete empirical research projects in mechanistic anomaly detection
Erik Jenner (ejenner) · 2024-04-03T23:07:21.502Z · comments (3)

The Case for Predictive Models
Rubi J. Hudson (Rubi) · 2024-04-03T18:22:20.243Z · comments (7)

Ambiguity in Prediction Market Resolution is Still Harmful
aphyer · 2024-07-31T20:32:40.217Z · comments (17)

Take SCIFs, it’s dangerous to go alone
latterframe · 2024-05-01T08:02:38.067Z · comments (1)

Wholesomeness and Effective Altruism
owencb · 2024-02-28T20:28:22.175Z · comments (3)

Evidential Cooperation in Large Worlds: Potential Objections & FAQ
Chi Nguyen · 2024-02-28T18:58:25.688Z · comments (5)

Deep and obvious points in the gap between your thoughts and your pictures of thought
KatjaGrace · 2024-02-23T07:30:07.461Z · comments (6)

[link] you should probably eat oatmeal sometimes
bhauth · 2024-08-25T14:50:37.570Z · comments (32)

A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed
johnswentworth · 2024-08-22T19:19:28.940Z · comments (4)

Estimating efficiency improvements in LLM pre-training
Daan · 2024-01-19T19:32:45.124Z · comments (3)

Protocol evaluations: good analogies vs control
Fabien Roger (Fabien) · 2024-02-19T18:00:09.794Z · comments (10)

[link] Surgery Works Well Without The FDA
Maxwell Tabarrok (maxwell-tabarrok) · 2024-01-26T13:31:29.968Z · comments (28)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

eric-dalva on Wagering on Will And Worth (Pascals Wager for Free Will and Value)

There exists something (objective value) universally compelling that any sufficiently advanced mind (/bayesian agent) would recognize as valuable - something possibly beyond our evolutionary happenstance and/or something timeles

Hey Robert,

I was thinking about this post again when reading [No Universally Compelling Arguments](https://www.lesswrong.com/posts/PtoQdG7E8MxYJrigu/no-universally-compelling-arguments) and was curious how you interperate this statement wrt to the post. Do you disagree with NUCA? do you have some other interpretation? Would be happy to hear what you think.

Eric

gwern on Policymakers don't have access to paywalled articles

Yeah, I was afraid that might apply here. It seems like you should still be able to do something like "government employee tier" subscriptions, not targeted at an individual but perhaps something like 'GS-8 and up', set low enough that it would appeal to such customers, perhaps? It is not a gift but a discount, it is not to an individual but to a class, it is part of a market, and it is not conditional on any government action or inaction, and such discounts are very common for 'students', 'veterans', 'first responders' etc, and I've never seen any fineprint warning government employees about it being >$20 despite many such discounts potentially crossing that threshold (eg. Sam's Club offers $50 off a new membership, and that seems clearly >$20, and to be doing it through a whole company devoted to this sort of discount, ID.me).

But I suppose that might be too complex for SA to be interested in bothering with?

gwern on When is reward ever the optimization target?

Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)

Nor is it at all guaranteed that 'the dog will wag the tail' - depending on circumstances, the tail may successfully wag the dog indefinitely. Maybe the outer level will be able to override the inner, maybe not. Because after all, the outer level may no longer exist, or may be too slow to be relevant, or may be changed (especially by the inner level). To continue the human example, we were created by evolution on genes, but within a lifetime, evolution has no effect on the policy and so even if evolution 'wants' to modify a human brain to do something other than what that brain does, it cannot operate within-lifetime (except at even lower levels of analysis, like in cancers or cell lineages etc); or, if the human brain is a digital emulation of a brain snapshot, it is no longer affected by evolution at all; and even if it does start to mold human brains, it is such a slow high-variance optimizer that it might take hundreds of thousands or millions of years... and there probably won't even be biological humans by that point, never mind the rapid progress over the next 1-3 generations in 'seizing the means of reproduction' if you will. (As pointed out in the context of Von Neumann probes or gray goo, if you add in error-correction, it is entirely possible to make replication so reliable that the universe will burn out before any meaningful level of evolution can happen, per the Price equation. The light speed delay to colonization also implies that 'cancers' will struggle to spread much if they take more than a handful of generations.)

jiro on If all trade is voluntary, then what is "exploitation?"

Sometimes the company is stupid because of principal/agent problems which benefit the particular boss involved, but are bad for the company as a whole.

erich_grunewald on Human takeover might be worse than AI takeover

You talk later about evolution to be selfish; not only is the story for humans is far more complicated (why do humans often offer an even split in the ultimatum game?), but also humans talk a nicer game than they act (See construal level theory, or social-desirability bias.). Once you start looking at AI agents who have similar affordances and incentives that humans have, I think you'll see a lot of the same behaviors.

Some people have looked at this, sorta:

I think I'd guess roughly that, "Claude is probably more altruistic and cooperative than the median Western human, most other models are probably about the same, or a bit worse, in these simulated scenarios". But of course a major difference here is that the LLMs don't actually have anything on the line -- they don't stand to earn or lose any money, for example, and if they did, they would have nothing to do with the money. So you might expect them to be more altruistic and cooperative than they would under the conditions humans are tested.

zack_m_davis on Enemies vs Malefactors

At the time, I remarked to some friends that it felt weird that this was being presented as a new insight to this audience in 2023 rather than already being local conventional wisdom.^[1] (Compare "Bad Intent Is a Disposition, Not a Feeling" (2017) or "Algorithmic Intent" [LW · GW] (2020).) Better late than never!

The "status" line at the top does characterize it as partially "common wisdom", but it's currently #14 in the 2023 Review 1000+ karma voting, suggesting novelty to the audience. ↩︎

weightt-an on weightt an's Shortform

I would really love if some "let's make asi" people put some effort into making bad outcomes less bad. Like, it would really suck if we are going to be trapped in endless corporate punk hell, with superintentelligent nannies with correct (tm) opinions. Or infinite wedding parties or whatever. Just make sure that if you fuck up we all just get eaten by nanobots please. Permanent entrapment in misery would be a lot worse.

jacob_cannell on How will we update about scheming?

Training processes with varying (apparent) situational awareness
1:2.5 The AI seemingly isn't aware it is an AI except for a small fraction of training which isn't where much of the capabilities are coming from. For instance, the system is pretrained on next token prediction, our evidence strongly indicates that the system doesn't know it is an AI when doing next token prediction (which likely requires being confident that it isn't internally doing a substantial amount of general-purpose thinking about what to think about), and there is only a small RL process which isn't where much of the capabilities are coming from.

Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model. The fact that you need to prompt them to summon out a situationally aware scheming agent doesn't seem like much of a barrier, and indeed strong frontier base models are so obviously misaligned/jail-breakable/dangerous that releasing them to the public is PR-harmful enough to motivate RLHF post training purely for selfish profit-motives.

> This implies that restricting when AIs become (saliently) aware that they are an AI could be a promising intervention, to the extent this is possible without greatly reducing competitiveness.

Who cares if it greatly reduces competitiveness in experimental training runs?

We need to figure out how to align superhuman models - models trained with > 1e25 efficient flops on the current internet/knowledge, which requires experimental iteration. We probably won't get multiple iteration attempts for aligning SI 'in prod', so we need to iterate in simulation (what you now call 'model organisms').

We need to find alignment training methods that work even when the agent has superhuman intelligence/inference. But 'superhuman' hear is relative - measured against our capabilities. The straightforward easy way to accomplish this is training agents in simulations with much earlier knowledge cutoff dates, which isn't theoretically hard - just requires constructing augmented historical training datasets. So you could train on a 10T+ token dataset of human writings/thoughts with cutoff 2010, or 1950, or 1700, etc. These base models wouldn't be capable of simulating/summoning realistic situationally aware agents, their RL derived agents wouldn't be situationally sim-aware either, etc.

sharmake-farah on When is reward ever the optimization target?

So in essence, even if reward truly isn't the optimization target at the outer level, that doesn't imply that all policies trained do not maximize the reward, right?

joey-kl on Drake Thomas's Shortform

Interesting, I can see why that would be a feature. I don't mind the taste at all actually. Before, I had some of their smaller citrus flavored kind, and they dissolved super quick and made me a little nauseous. I can see these ones being better in that respect.