LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Self-Resolving Prediction Markets
PeterMcCluskey · 2024-03-03T02:39:42.212Z · comments (0)

[link] OpenAI, DeepMind, Anthropic, etc. should shut down.
Tamsin Leake (carado-1) · 2023-12-17T20:01:22.332Z · comments (48)

[link] Talking With People Who Speak to Congressional Staffers about AI risk
Eneasz · 2023-12-14T17:55:50.606Z · comments (0)

[link] A computational complexity argument for many worlds
jessicata (jessica.liu.taylor) · 2024-08-13T19:35:10.116Z · comments (15)

A quick experiment on LMs’ inductive biases in performing search
Alex Mallen (alex-mallen) · 2024-04-14T03:41:08.671Z · comments (2)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (34)

[question] What Software Should Exist?
Tomás B. (Bjartur Tómas) · 2024-01-19T21:43:50.112Z · answers+comments (27)

Comparing Quantized Performance in Llama Models
NickyP (Nicky) · 2024-07-15T16:01:24.960Z · comments (2)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (114)

Learning Math in Time for Alignment
Nicholas / Heather Kross (NicholasKross) · 2024-01-09T01:02:37.446Z · comments (3)

[link] the subreddit size threshold
bhauth · 2024-01-23T00:38:13.747Z · comments (3)

[link] Concrete benefits of making predictions
Jonny Spicer (jonnyspicer) · 2024-10-17T14:23:17.613Z · comments (5)

Housing Roundup #10
Zvi · 2024-10-29T13:50:09.416Z · comments (2)

How I build and run behavioral interviews
benkuhn · 2024-02-26T05:50:05.328Z · comments (6)

A path to human autonomy
Nathan Helm-Burger (nathan-helm-burger) · 2024-10-29T03:02:42.475Z · comments (12)

[link] NAO Updates, Fall 2024
jefftk (jkaufman) · 2024-10-18T00:00:04.142Z · comments (2)

[link] Why you, personally, should want a larger human population
jasoncrawford · 2024-02-23T19:48:10.526Z · comments (32)

Is suffering like shit?
KatjaGrace · 2024-05-31T01:20:03.855Z · comments (5)

Monthly Roundup #13: December 2023
Zvi · 2023-12-19T15:10:08.293Z · comments (5)

On Not Requiring Vaccination
jefftk (jkaufman) · 2024-02-01T19:20:12.657Z · comments (21)

[LDSL#1] Performance optimization as a metaphor for life
tailcalled · 2024-08-08T16:16:27.349Z · comments (4)

Apply to MATS 7.0!
Ryan Kidd (ryankidd44) · 2024-09-21T00:23:49.778Z · comments (0)

Mentorship in AGI Safety (MAGIS) call for mentors
Valentin2026 (Just Learning) · 2024-05-23T18:28:03.173Z · comments (3)

How Would an Utopia-Maximizer Look Like?
Thane Ruthenis · 2023-12-20T20:01:18.079Z · comments (23)

[link] A Narrative History of Environmentalism's Partisanship
Jeffrey Heninger (jeffrey-heninger) · 2024-05-14T16:51:01.029Z · comments (3)

Protestants Trading Acausally
Martin Sustrik (sustrik) · 2024-04-01T14:46:26.374Z · comments (4)

I was raised by devout Mormons, AMA [&|] Soliciting Advice
ErioirE (erioire) · 2024-03-13T16:52:19.130Z · comments (41)

Extracting SAE task features for in-context learning
Dmitrii Kharlapenko (dmitrii-kharlapenko) · 2024-08-12T20:34:13.747Z · comments (1)

Incentive design and capability elicitation
Joe Carlsmith (joekc) · 2024-11-12T20:56:05.088Z · comments (0)

Retrospective: PIBBSS Fellowship 2023
DusanDNesic · 2024-02-16T17:48:32.151Z · comments (1)

Attention Output SAEs Improve Circuit Analysis
Connor Kissane (ckkissane) · 2024-06-21T12:56:07.969Z · comments (0)

Meme Talking Points
ymeskhout · 2024-11-06T15:27:54.024Z · comments (0)

[link] Fifty Flips
abstractapplic · 2023-10-01T15:30:43.268Z · comments (14)

A more systematic case for inner misalignment
Richard_Ngo (ricraz) · 2024-07-20T05:03:03.500Z · comments (4)

[question] When is reward ever the optimization target?
Noosphere89 (sharmake-farah) · 2024-10-15T15:09:20.912Z · answers+comments (12)

[question] When did Eliezer Yudkowsky change his mind about neural networks?
[deactivated] (Yarrow Bouchard) · 2023-11-14T21:24:00.000Z · answers+comments (15)

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”
Miles Turpin (miles) · 2023-10-03T02:22:00.199Z · comments (0)

Late-talking kid part 3: gestalt language learning
Steven Byrnes (steve2152) · 2023-10-17T02:00:05.182Z · comments (5)

UDT1.01: Plannable and Unplanned Observations (3/10)
Diffractor · 2024-04-12T05:24:34.435Z · comments (0)

Context-dependent consequentialism
Jeremy Gillen (jeremy-gillen) · 2024-11-04T09:29:24.310Z · comments (6)

Balancing Label Quantity and Quality for Scalable Elicitation
Alex Mallen (alex-mallen) · 2024-10-24T16:49:00.939Z · comments (1)

[link] Aaron Silverbook on anti-cavity bacteria
DanielFilan · 2023-11-20T03:06:19.524Z · comments (3)

[link] Lying is Cowardice, not Strategy
Connor Leahy (NPCollapse) · 2023-10-24T13:24:25.450Z · comments (73)

Superforecasting the premises in “Is power-seeking AI an existential risk?”
Joe Carlsmith (joekc) · 2023-10-18T20:23:51.723Z · comments (3)

[link] [Linkpost] Statement from Scarlett Johansson on OpenAI's use of the "Sky" voice, that was shockingly similar to her own voice.
Linch · 2024-05-20T23:50:28.138Z · comments (8)

Good Bings copy, great Bings steal
dr_s · 2024-04-21T09:52:46.658Z · comments (6)

[link] Anthropic, Google, Microsoft & OpenAI announce Executive Director of the Frontier Model Forum & over $10 million for a new AI Safety Fund
Zach Stein-Perlman · 2023-10-25T15:20:52.765Z · comments (8)

Game Theory without Argmax [Part 2]
Cleo Nardo (strawberry calm) · 2023-11-11T16:02:41.836Z · comments (14)

Features and Adversaries in MemoryDT
Joseph Bloom (Jbloom) · 2023-10-20T07:32:21.091Z · comments (6)

AI labs can boost external safety research
Zach Stein-Perlman · 2024-07-31T19:30:16.207Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

seth-herd on OpenAI Email Archives (from Musk v. Altman)

I think you're assuming a sharp line between sincere ethics motivations and self-interest. In my view, that doesn't usually exist. People are prone to believe things that suit their self-interest. That motivated reasoning is the biggest problem with public discourse. People aren't lying, they're just confused. I think Musk definitely and even probably Altman believe they're doing the best thing for humanity - they're just confused and not taking the effort to get un-confused.

I'm really sorry all of that happened to you. Capitalism is a harsh system, and humans are harsh beings when we're competing. And confused beings, even when we're trying not to be harsh. I didn't have time to go through your whole story, but I fully believe you were wronged.

I think most villains are the heroes of their own stories. Some of us are more genuinely altruistic than others - but we're all confused in our own favor to one degree or another.

So reducing confusion while playing to everyone's desire to be a hero is one route to survival.

seth-herd on What are Emotions?

I'm so glad you found that response helpful!

I primarily mean reward in the sense of reinforcement - a functional definition from animal psychology and neuroscience: reinforcement is whatever makes the previous behavior more likely in the future.

But I also mean a positive feeling (qualia if you like, although I find that term too contentious to use much). I think we have a positive feeling when we're getting a reward (reinforcement), but I'm not sure that all positive feelings work as enforcement. Maybe.

As to how deep can that recursive learning mechanism go: very deep. When people spend time arguing about logic and abstract values online, they've gone deep. There's no limit- until the world intervenes to tell you your chain of predicted-reward inferences has gone off-track. For instance, if that person has lost their job, and they're cold and hungry, they might track down the (correct) logic that they ascribed too much value to proving people wrong on the internet, and reduce their estimate of its value.

mu_-negative on What TMS is like

Hey, I remember your medical miracle post. I enjoyed it!

"Objectively" for me would translate to "biomarker" i.e., a bio-physical signal that predicts a clinical outcome. Note that for depression and many psychological issues this means that we find the biomarkers by asking people how they feel...but maybe this is ok because we do huge studies with good controls, and the biomarkers may take on a life of their own after they are identified.

I'm assuming you mean biomarkers for psychological / mental health outcomes specifically. This is spiritually pretty close to what my lab studies - ways to predict how TMS will affect individuals, and adjust it to make it work better in each person. Our philosophy - which I had to think about for a bit to even articulate, it's so baked into our thinking - is that the effects of an intervention will manifest most reliably in reactions to very simple cognitive tasks like vigilance, working memory, and so on. Most serious health issues impact your reaction times, accuracy, bias, etc. in subtle but statistically reliable ways. Measuring these with random sampling from a phone app and doing good statistics on the data is probably your best bet for objectively assessing interventions. Maybe that is what Quantified Mind does, I'm not sure?

The short answer is that if this were easy, it would already be popular, because we clearly need it. A lot of academic labs and industry people are trying to do this all the time. There is growing success, but it's slow growing and fraught with non-replicable work.

jenn on Social events with plausible deniability

i think i agree that this does justified harm, but maybe for some subgroups or communities the justified harm is worth the benefits of such an event? our local rationality community has developed to a point where i think people are comfortable talking about "controversial" statements with their real faces on because the vibes are one where any attempt at cancellation instead of dialogue will be met with eyerolls and social exclusion but like, you know, it took a pretty long time and sustained effort for us to get here. (and maybe im wrong and there are people in the group with opinions they are still afraid to voice!)

im modelling this as something kind of like authentic relating - you're hacking the group's intimacy module and ratcheting up the feeling of closeness with a shortcut. it's not going to be as good as the genuine thing, but maybe it's a lot better than what one would have general access to. it's not everyone's thing, people with enough access to the genuine goods are likely to be like "wtf this is weird", sometimes it can go catastrophically wrong if the facilitator drops the ball... but despite all of that, for some people it's a good thing to do occasionally bc otherwise they will never get enough of that social nutrient naturally

seth-herd on Training AI agents to solve hard problems could lead to Scheming

Question 1:

I stay in the loop when my AGI is solving hard problems. Absolutely it will need persistent goals, new reasoning, and continuous learning to make progress. That changing mind opens up The alignment stability problem [LW · GW] as you note in your comment on the other thread. My job is making sure it's not going off the rails WRT my intent as it works.

People will do this by default. Letting it run for any length of time without asking questions about what it's up to would be both very expensive, and beyond the bounds of patience and curiosity for almost any humans. I instructed it to cure cancer, but I'm going to keep asking it how it's planning to do that and what progress its making. My important job is asking it about its alignment continually as it learns and plans. I'm frequently asking if it's had ideas about scheming to get its (subgoal) accomlished (while of course re-iterating the standing instructions to tell me the whole truth relevant to my requests). Its alignmnt is my job, until it's so much smarter than me, and clearly understands my intent, to trust it to keep itself aligned.

Question 2:

Yes, instruction-following should be helpful-only. Giving a bunch of constraints on the instructions it will follow adds risk that it won't obey your instructions to shut down or amend its goals or its understanding of previous instructions. That's the principal advantage of corrigibility. Max Harms details this logic [LW · GW] in much more compelling detail.

Yes, this definitely opens up the prospect of misuse, and that is terrifying. But this is not only the safer early route, it's the one AGI project leaders will choose - because they're people who like power.

An org that's created instruction-following AGI would have it follow instructions only from one or a few top "principals". They would instruct it to follow a limited set of instructions from any users they license its instances to. Some of those users would try to jailbreak it to follow dangerous instructions.

And having even a few humans from different groupos (e.g., rival governments) fully in charge of real AGIs would be terrifying. Much more on this in [If we solve alignment, do we die anyway?](https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1).

(Provisional current answer after that discussion: it becomes a tricky political negotiation. Who knows. But maybe.

I favor this approach far above just giving a list of goals and side constraints and hoping they're not too badly interpreted. The reasoning in your post pretty much describes what I'm worried about - although there are also some less intuitive but logically sound misinterpretations of lots of goals that might come into play as the AGI becomes thoroughly superhuman and does more of its own thinking.

To your final observation:

Yes, some outcome-based RL is probably inevitable. o1 was probably trained that way, and others will follow. Let us hope it is not too much to overwhelm the non-instrumental training and the explicit goal of following instructions. And let us figure out how much is likely to be too much before we learn the hard way.

charlie-steiner on Another UFO Bet

The bet is indeed on. See you back here in 2029 :)

seth-herd on Project Adequate: Seeking Cofounders/Funders

I missed this until I finally got around to responding to your last post, which I'd put on my todo list.

I applaud your initiative and drive! I do think it's a tough pitch to try to leapfrog the fast progressive in deep networks. Nor do I think the alignment picture for those types of systems is nearly as bleak as Yudkowsky & the old school believe. But neither is it likely to be easy enough to leave to chance and those who don't fully grasp the seriousness of the problem. I've written about some of the most common Cruxes of disagreement on alignment difficulty [LW(p) · GW(p)].

So I'd suggest you would have better odds working within the ML framework that is happening with or without your help. I also think that even if you do produce a near-miraculous breakthrough in symbolic GOFAI, Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc [LW · GW].

OTOH, if you have a really insightful approach, and a good reason to think the result would be easier to align than language model agents, maybe pursuing that path makes sense, since no one else is doing exactly that. As I said in my comment on your last request for directions, I think there are higher-expected-value nearly-as-underserved routes to survival; namely, working on alignment for the LLM agents that are our most likely route to first AGIs at this point (focusing on different routes from aligning the base LLMs, which is common but inadequate).

I'm also happy to talk. Your devotion to the project is impressive, and a resource not to be wasted!

mu_-negative on What TMS is like

Took me a while to get back to this question. I didn't know the answer so I looked up some papers. The short answer is, knowing this requires long follow-up periods which studies are generally not good at so we don't have great answers. Definitely a significant number of people don't stay better.

The longer answer is, probably about half of people need some form of maintenance treatment to stay non-depressed for more than a year, but our view of this is very confounded. Some studies have used normal antidepressant medications for maintenance, and some studies have tried additional rounds of TMS, both of which work really well. Up to a third of patients experience "symptom worsening" meaning that after an initial improvement from TMS, their symptoms actually get worse than when they started, but apparently more TMS can fix this in most people? I wasn't completely sure what they were saying here. So yeah, it isn't great. A lot of people need maintenance of some kind. This could very well correlate with whether your depression is the "life circumstances" kind or the "intrinsic brain chemistry" kind, not that we have a great handle on differentiating those two either.

Furthermore, (1) there are a few modes of TMS therapy out there, including most notably the accelerated course, and there may be different relapse rates across these treatment modes. There is some handwaving that the accelerated course may be more effective in this regard but I don't think we know yet. And (2) another important issue with interpreting these data is that many of the studies are done on people who are treatment resistant, such as yourself. It's unclear how much the results translate to the general population of depressed people.

Overall this is probably not a very satisfying answer, I don't really have the specialist inside view on this one.

FYI the most targeted paper I found on this topic is the citation below. Note that it's from 2016. There is probably something more recent, I just didn't have more time to dig.

Sackeim, H. A. (2016). Acute continuation and maintenance treatment of major depressive episodes with transcranial magnetic stimulation. Brain Stimulation: Basic, Translational, and Clinical Research in Neuromodulation, 9(3), 313-319.

kave on Why imperfect adversarial robustness doesn't doom AI control

I don't think this distinction is robust enough to rely on as much of a defensive property. I think it's probably not that hard to think "I probably would have tried something in direction X, or direction Y", and then gather lots of bits about how well the clusters X and Y work.

cousin_it on Social events with plausible deniability

Oh right.