LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI #59: Model Updates
Zvi · 2024-04-11T14:20:06.339Z · comments (2)

Against "argument from overhang risk"
RobertM (T3t) · 2024-05-16T04:44:00.318Z · comments (11)

[link] Baking vs Patissing vs Cooking, the HPS explanation
adamShimi · 2024-07-17T20:29:09.645Z · comments (16)

AI Constitutions are a tool to reduce societal scale risk
Sammy Martin (SDM) · 2024-07-25T11:18:17.826Z · comments (2)

AIS terminology proposal: standardize terms for probability ranges
eggsyntax · 2024-08-30T15:43:39.857Z · comments (12)

A Case for Superhuman Governance, using AI
ozziegooen · 2024-06-07T00:10:10.902Z · comments (0)

The Intentional Stance, LLMs Edition
Eleni Angelou (ea-1) · 2024-04-30T17:12:29.005Z · comments (3)

The Math of Suspicious Coincidences
Roko · 2024-02-07T13:32:35.513Z · comments (3)

Fun With CellxGene
sarahconstantin · 2024-09-06T22:00:03.461Z · comments (2)

"Full Automation" is a Slippery Metric
ozziegooen · 2024-06-11T19:56:49.855Z · comments (1)

AI #85: AI Wins the Nobel Prize
Zvi · 2024-10-10T13:40:07.286Z · comments (6)

[link] Safety tax functions
owencb · 2024-10-20T14:08:38.099Z · comments (0)

[link] [Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs
Yohan Mathew (ymath) · 2024-09-25T14:52:48.263Z · comments (2)

[link] Anthropic: Reflections on our Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2024-05-20T04:14:44.435Z · comments (21)

Understanding Subjective Probabilities
Isaac King (KingSupernova) · 2023-12-10T06:03:27.958Z · comments (16)

[link] AISN #28: Center for AI Safety 2023 Year in Review
aogara (Aidan O'Gara) · 2023-12-23T21:31:40.767Z · comments (1)

Running the Numbers on a Heat Pump
jefftk (jkaufman) · 2024-02-09T03:00:04.920Z · comments (12)

Adversarial Robustness Could Help Prevent Catastrophic Misuse
aogara (Aidan O'Gara) · 2023-12-11T19:12:26.956Z · comments (18)

[link] How "Pause AI" advocacy could be net harmful
Tamsin Leake (carado-1) · 2023-12-26T16:19:20.724Z · comments (10)

I played the AI box game as the Gatekeeper — and lost
datawitch · 2024-02-12T18:39:35.777Z · comments (52)

Interpreting Quantum Mechanics in Infra-Bayesian Physicalism
Yegreg · 2024-02-12T18:56:03.967Z · comments (6)

Some additional SAE thoughts
Hoagy · 2024-01-13T19:31:40.089Z · comments (4)

Information-Theoretic Boxing of Superintelligences
JustinShovelain · 2023-11-30T14:31:11.798Z · comments (0)

[link] 2024 State of the AI Regulatory Landscape
Deric Cheng (deric-cheng) · 2024-05-28T11:59:06.582Z · comments (0)

[link] The origins of the steam engine: An essay with interactive animated diagrams
jasoncrawford · 2023-11-29T18:30:36.315Z · comments (1)

Interpreting the Learning of Deceit
RogerDearnaley (roger-d-1) · 2023-12-18T08:12:39.682Z · comments (14)

[question] What are things you're allowed to do as a startup?
Elizabeth (pktechgirl) · 2024-06-20T00:01:59.257Z · answers+comments (9)

[link] When scientists consider whether their research will end the world
Harlan · 2023-12-19T03:47:06.645Z · comments (4)

Differential Optimization Reframes and Generalizes Utility-Maximization
J Bostock (Jemist) · 2023-12-27T01:54:22.731Z · comments (2)

[link] There is no IQ for AI
Gabriel Alfour (gabriel-alfour-1) · 2023-11-27T18:21:26.196Z · comments (10)

The Third Gemini
Zvi · 2024-02-20T19:50:05.195Z · comments (2)

Putting multimodal LLMs to the Tetris test
Lovre · 2024-02-01T16:02:12.367Z · comments (5)

Two Tales of AI Takeover: My Doubts
Violet Hour · 2024-03-05T15:51:05.558Z · comments (8)

Some comments on intelligence
Viliam · 2024-08-01T15:17:07.215Z · comments (5)

AI #62: Too Soon to Tell
Zvi · 2024-05-02T15:40:04.364Z · comments (8)

Sparse MLP Distillation
slavachalnev · 2024-01-15T19:39:02.926Z · comments (3)

[link] Debate helps supervise human experts [Paper]
habryka (habryka4) · 2023-11-17T05:25:17.030Z · comments (6)

[question] What Other Lines of Work are Safe from AI Automation?
RogerDearnaley (roger-d-1) · 2024-07-11T10:01:12.616Z · answers+comments (35)

Results from the Turing Seminar hackathon
Charbel-Raphaël (charbel-raphael-segerie) · 2023-12-07T14:50:38.377Z · comments (1)

A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans
Thane Ruthenis · 2023-12-17T20:28:57.854Z · comments (7)

AI #61: Meta Trouble
Zvi · 2024-05-02T18:40:03.242Z · comments (0)

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
Joe Carlsmith (joekc) · 2023-11-29T16:32:30.068Z · comments (1)

Weekly newsletter for AI safety events and training programs
Bryce Robertson (bryceerobertson) · 2024-05-03T00:33:29.418Z · comments (0)

[link] My MATS Summer 2023 experience
James Chua (james-chua) · 2024-03-20T11:26:14.944Z · comments (0)

Throughput vs. Latency
alkjash · 2024-01-12T21:37:07.632Z · comments (2)

Deception Chess: Game #2
Zane · 2023-11-29T02:43:22.375Z · comments (17)

D&D.Sci Hypersphere Analysis Part 1: Datafields & Preliminary Analysis
aphyer · 2024-01-13T20:16:39.480Z · comments (1)

[link] What fuels your ambition?
Cissy · 2024-01-31T18:30:53.274Z · comments (1)

Paper Summary: Princes and Merchants: European City Growth Before the Industrial Revolution
Jeffrey Heninger (jeffrey-heninger) · 2024-07-15T21:30:04.043Z · comments (1)

Aggregative Principles of Social Justice
Cleo Nardo (strawberry calm) · 2024-06-05T13:44:47.499Z · comments (10)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

seth-herd on Project Adequate: Seeking Cofounders/Funders

I missed this until I finally got around to responding to your last post, which I'd put on my todo list.

I applaud your initiative and drive! I do think it's a tough pitch to try to leapfrog the fast progressive in deep networks. Nor do I think the alignment picture for those types of systems is nearly as bleak as Yudkowsky & the old school believe. But neither is it likely to be easy enough to leave to chance and those who don't fully grasp the seriousness of the problem. I've written about some of the most common Cruxes of disagreement on alignment difficulty [LW(p) · GW(p)].

So I'd suggest you would have better odds working within the ML framework that is happening with or without your help. I also think that even if you do produce a near-miraculous breakthrough in symbolic GOFAI, Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc [LW · GW].

OTOH, if you have a really insightful approach, and a good reason to think the result would be easier to align than language model agents, maybe pursuing that path makes sense, since no one else is doing exactly that. As I said in my comment on your last request for directions, I think there are higher-expected-value nearly-as-underserved routes to survival; namely, working on alignment for the LLM agents that are our most likely route to first AGIs at this point (focusing on different routes from aligning the base LLMs, which is common but inadequate).

I'm also happy to talk. Your devotion to the project is impressive, and a resource not to be wasted!

mu_-negative on What TMS is like

Took me a while to get back to this question. I didn't know the answer so I looked up some papers. The short answer is, knowing this requires long follow-up periods which studies are generally not good at so we don't have great answers. Definitely a significant number of people don't stay better.

The longer answer is, probably about half of people need some form of maintenance treatment to stay non-depressed for more than a year, but our view of this is very confounded. Some studies have used normal antidepressant medications for maintenance, and some studies have tried additional rounds of TMS, both of which work really well. Up to a third of patients experience "symptom worsening" meaning that after an initial improvement from TMS, their symptoms actually get worse than when they started, but apparently more TMS can fix this in most people? I wasn't completely sure what they were saying here. So yeah, it isn't great. A lot of people need maintenance of some kind. This could very well correlate with whether your depression is the "life circumstances" kind or the "intrinsic brain chemistry" kind, not that we have a great handle on differentiating those two either.

Furthermore, (1) there are a few modes of TMS therapy out there, including most notably the accelerated course, and there may be different relapse rates across these treatment modes. There is some handwaving that the accelerated course may be more effective in this regard but I don't think we know yet. And (2) another important issue with interpreting these data is that many of the studies are done on people who are treatment resistant, such as yourself. It's unclear how much the results translate to the general population of depressed people.

Overall this is probably not a very satisfying answer, I don't really have the specialist inside view on this one.

FYI the most targeted paper I found on this topic is the citation below. Note that it's from 2016. There is probably something more recent, I just didn't have more time to dig.

Sackeim, H. A. (2016). Acute continuation and maintenance treatment of major depressive episodes with transcranial magnetic stimulation. Brain Stimulation: Basic, Translational, and Clinical Research in Neuromodulation, 9(3), 313-319.

kave on Why imperfect adversarial robustness doesn't doom AI control

I don't think this distinction is robust enough to rely on as much of a defensive property. I think it's probably not that hard to think "I probably would have tried something in direction X, or direction Y", and then gather lots of bits about how well the clusters X and Y work.

cousin_it on Social events with plausible deniability

Oh right.

seth-herd on One person's worth of mental energy for AI doom aversion jobs. What should I do?

Did you make any progress on choosing a course? My brief pitch is this: LLM agents are our most likely route to AGI, and particularly likely in short timelines. Aligning them is not the same as aligning the base LLMs. Yet almost no one is working on bridging that gap.

That's what I'm working on. More can be found in my user profile.

I do think this is high prospective impact. I'm not sure what you mean by low prospective risk. I think the work has good odds of being at least somewhat useful, since it's so neglected and it's pretty commonly agreed that language model agents (or foundation model agents or LLM cognitive architectures) are a pretty likely path to first AGI.

I'm happy to talk more. I meant to respond here sooner.

seth-herd on What (if anything) made your p(doom) go down in 2024?

The recent rumors about slowed progress in large training runs have reduced my p(doom). More time to prepare for AGI raises our odds. This probably won't be a large delay. This is combined with the observation that inference-time compute does also scale results, but it probably doesn't scale them that fast - the graph released with o1 preview didn't include units on the cost/compute axis.

More than that, my p(doom) went steadily down as I kept contemplating instruction-following as the central alignment goal. I increasingly think it's the obvious thing to try once you're actually contemplating launching an AGI that could become smarter than you; and it's a huge benefit to any technical alignment scheme, since it offers the advantages of corrigibility, allowing you to correct some alignment errors.

More on that logic in Instruction-following AGI is easier and more likely than value aligned AGI [LW · GW]

johnburidan on OpenAI Email Archives (from Musk v. Altman)

From a historical perspective this is an excellent treasure cache. Truly when you are the cutting edge of something ideas, relationships, personality, and economics all truly come together to drive history.

radford-neal-1 on Social events with plausible deniability

Then you know that someone who voiced opinion A that you put in the hat, and also opinion B, likely actually believes opinion B.

(There's some slack from the possibility that someone else put opinion B in the hat.)

marius-hobbhahn on Training AI agents to solve hard problems could lead to Scheming

Some questions and responses:
1. What if you want the AI to solve a really hard problem? You don't know how to solve it, so you cannot give it detailed instructions. It's also so hard that the AI cannot solve it without learning new things -> you're back to the story above. The story also just started with someone instructing the model to "cure cancer".
2. Instruction following models are helpful-only. What do you do about the other two H's? Do you trust the users to only put in good instructions? I guess you do want to have some side constraints baked into its personality and these can function like goals. Many of the demonstrations that we have for scheming are cases where the model is too much of a saint, i.e. it schemes for the right cause. For example, it might be willing to deceive its developers if we provide it with strong reasons that they have non-HHH goals. I'm not really sure what to make of this. I guess it's good that it cares about being harmless and honest, but it's also a little bit scary that it cares so much.

My best guess for how the approach should look is that some outcome-based RL will be inevitable if we want to unlock the benefits, we just have to hammer the virtues of being non-scheming and non-power-seeking into it at all points of the training procedure. And we then have to add additional lines of defense like control, interpretability, scalable oversight, etc. and think hard about how we minimize correlated failures. But I feel like right now, we don't really have the right tools, model organisms, and evals to establish whether any of these lines of defense actually reduce the problem.

leogao on "It's a 10% chance which I did 10 times, so it should be 100%"

related: https://xkcd.com/217/