LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

[link] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-10-05T21:01:39.767Z · comments (21)

Book Review: Going Infinite
Zvi · 2023-10-24T15:00:02.251Z · comments (109)

Alignment Implications of LLM Successes: a Debate in One Act
Zack_M_Davis · 2023-10-21T15:22:23.053Z · comments (50)

Announcing MIRI’s new CEO and leadership team
Gretta Duleba (gretta-duleba) · 2023-10-10T19:22:11.821Z · comments (52)

Thoughts on responsible scaling policies and regulation
paulfchristiano · 2023-10-24T22:21:18.341Z · comments (33)

We're Not Ready: thoughts on "pausing" and responsible scaling policies
HoldenKarnofsky · 2023-10-27T15:19:33.757Z · comments (33)

Labs should be explicit about why they are building AGI
peterbarnett · 2023-10-17T21:09:20.711Z · comments (16)

Announcing Timaeus
Jesse Hoogland (jhoogland) · 2023-10-22T11:59:03.938Z · comments (15)

AI as a science, and three obstacles to alignment strategies
So8res · 2023-10-25T21:00:16.003Z · comments (79)

Architects of Our Own Demise: We Should Stop Developing AI
Roko · 2023-10-26T00:36:05.126Z · comments (74)

[link] President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence
Tristan Williams (tristan-williams) · 2023-10-30T11:15:38.422Z · comments (39)

Thomas Kwa's MIRI research experience
Thomas Kwa (thomas-kwa) · 2023-10-02T16:42:37.886Z · comments (52)

RSPs are pauses done right
evhub · 2023-10-14T04:06:02.709Z · comments (70)

Evaluating the historical value misspecification argument
Matthew Barnett (matthew-barnett) · 2023-10-05T18:34:15.695Z · comments (140)

Holly Elmore and Rob Miles dialogue on AI Safety Advocacy
jacobjacob · 2023-10-20T21:04:32.645Z · comments (30)

Announcing Dialogues
Ben Pace (Benito) · 2023-10-07T02:57:39.005Z · comments (51)

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Simon Lermen (dalasnoin) · 2023-10-12T19:58:02.119Z · comments (29)

[link] Will no one rid me of this turbulent pest?
Metacelsus · 2023-10-14T15:27:21.497Z · comments (23)

[link] Comp Sci in 2027 (Short story by Eliezer Yudkowsky)
sudo · 2023-10-29T23:09:56.730Z · comments (22)

Comparing Anthropic's Dictionary Learning to Ours
Robert_AIZI · 2023-10-07T23:30:32.402Z · comments (8)

At 87, Pearl is still able to change his mind
rotatingpaguro · 2023-10-18T04:46:29.339Z · comments (15)

Response to Quintin Pope's Evolution Provides No Evidence For the Sharp Left Turn
Zvi · 2023-10-05T11:39:02.393Z · comments (29)

Graphical tensor notation for interpretability
Jordan Taylor (Nadroj) · 2023-10-04T08:04:33.341Z · comments (11)

Don't Dismiss Simple Alignment Approaches
Chris_Leong · 2023-10-07T00:35:26.789Z · comments (9)

The 99% principle for personal problems
Kaj_Sotala · 2023-10-02T08:20:07.379Z · comments (20)

Goodhart's Law in Reinforcement Learning
jacek (jacek-karwowski) · 2023-10-16T00:54:11.669Z · comments (22)

Stampy's AI Safety Info soft launch
steven0461 · 2023-10-05T22:13:04.632Z · comments (9)

Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
jdp · 2023-10-20T07:32:28.749Z · comments (14)

unRLHF - Efficiently undoing LLM safeguards
Pranav Gade (pranav-gade) · 2023-10-12T19:58:08.811Z · comments (15)

I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines
307th · 2023-10-20T16:37:46.541Z · comments (32)

[link] Responsible Scaling Policies Are Risk Management Done Wrong
simeon_c (WayZ) · 2023-10-25T23:46:34.247Z · comments (33)

[link] A new intro to Quantum Physics, with the math fixed
titotal (lombertini) · 2023-10-29T15:11:27.168Z · comments (22)

[link] The Witching Hour
Richard_Ngo (ricraz) · 2023-10-10T00:19:37.786Z · comments (0)

Apply for MATS Winter 2023-24!
Rocket (utilistrutil) · 2023-10-21T02:27:34.350Z · comments (6)

Charbel-Raphaël and Lucius discuss Interpretability
Mateusz Bagiński (mateusz-baginski) · 2023-10-30T05:50:34.589Z · comments (7)

TOMORROW: the largest AI Safety protest ever!
Holly_Elmore · 2023-10-20T18:15:18.276Z · comments (25)

Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger (Fabien) · 2023-10-23T16:37:45.611Z · comments (3)

What's up with "Responsible Scaling Policies"?
habryka (habryka4) · 2023-10-29T04:17:07.839Z · comments (8)

Truthseeking when your disagreements lie in moral philosophy
Elizabeth (pktechgirl) · 2023-10-10T00:00:04.130Z · comments (4)

What's Hard About The Shutdown Problem
johnswentworth · 2023-10-20T21:13:27.624Z · comments (31)

I don’t find the lie detection results that surprising (by an author of the paper)
JanB (JanBrauner) · 2023-10-04T17:10:51.262Z · comments (8)

[question] Lying to chess players for alignment
Zane · 2023-10-25T17:47:15.033Z · answers+comments (54)

Value systematization: how values become coherent (and misaligned)
Richard_Ngo (ricraz) · 2023-10-27T19:06:26.928Z · comments (47)

Symbol/Referent Confusions in Language Model Alignment Experiments
johnswentworth · 2023-10-26T19:49:00.718Z · comments (44)

Trying to understand John Wentworth's research agenda
johnswentworth · 2023-10-20T00:05:40.929Z · comments (11)

[link] Linkpost: They Studied Dishonesty. Was Their Work a Lie?
Linch · 2023-10-02T08:10:51.857Z · comments (12)

Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper
Neel Nanda (neel-nanda-1) · 2023-10-23T22:38:33.951Z · comments (12)

[link] Linkpost: A Post Mortem on the Gino Case
Linch · 2023-10-24T06:50:42.896Z · comments (7)

[link] Techno-humanism is techno-optimism for the 21st century
Richard_Ngo (ricraz) · 2023-10-27T18:37:39.776Z · comments (5)

Improving the Welfare of AIs: A Nearcasted Proposal
ryan_greenblatt · 2023-10-30T14:51:35.901Z · comments (5)

next page (older posts) →

Archive

Recent comments

the-gears-to-ascension on some thoughts on LessOnline

I’m only a year old rationalist

you write really eloquently for your age! and being in uni! wow. I was still learning to walk. kids are so precocious these days

⸮

keltan on some thoughts on LessOnline

“Whiteboards everywhere” and my non-ironic favourite band are debuting songs!!!

But, I’m only a year old rationalist and I live in Australia on a uni student budget. Still… I’m considering flying out. It would be pretty incredible to run some abstract improv workshops with other truth seeking nerds. I think I need to sit down and calculate.

Is this the type of event that a first year rationalist could attend and get value from/be welcome at? What is the likelihood that it will run again next year? Is there a prediction market for that?

the-gears-to-ascension on Raemon's Shortform

a ui on your user page where you get to pick a four letter shortening of your name and a color. the shortening is displayed as

t g
t a

in a tiny color-of-your-choice box. when picking your name, each time you pick a hue and saturation in the color picker (use a standard one, don't build a color picker), it does a query (debounced - I hope you have a standard way to debounce in react elements) for other people on the site who have that initialism, and shows you their colors in a list, along with an indicator min(color_distance(you.color, them.color) for them in other_users).

the color distance indicator could be something like the one from here, which would need transliterating into javascript:

This formula has results that are very close to L^*u^*v^* (with the modified lightness curve) and, more importantly, it is a more stable algorithm: it does not have a range of colours where it suddenly gives far from optimal results. The weights of the formula could be optimized further, but again, the selection of the closest colour is subjective. My goal was to find a reasonable compromise.
typedef struct {
   unsigned char r, g, b;
} RGB;

double ColourDistance(RGB e1, RGB e2)
{
  long rmean = ( (long)e1.r + (long)e2.r ) / 2;
  long r = (long)e1.r - (long)e2.r;
  long g = (long)e1.g - (long)e2.g;
  long b = (long)e1.b - (long)e2.b;
  return sqrt((((512+rmean)*r*r)>>8) + 4*g*g + (((767-rmean)*b*b)>>8));
}

keltan on Observations on Teaching for Four Weeks

That’s a great question! I’ve been teaching arts classes for a youth charity for 5 years now. Ages range from 5-18. I myself am 23.

I’d say this has happened twice? I’m counting a one off lesson with some 16-18 year olds a few years ago. And a series of weeks in which I had extremely little control over some 8-10 year olds. In that case I was able to control individuals if they had my full attention. But would ‘lose’ them when I focused on the next kid.

Your question caused me to think of why these things may have happened. Though I’m curious to hear what you think before I spill my guts.

mathieuroy on Let's split the cake, lengthwise, upwise and slantwise

thanks, it worked! https://web.archive.org/web/20150412211654/http://reducing-suffering.org/wp-content/uploads/2015/02/wild-animals_2015-02-28.pdf

mathieuroy on Mati_Roy's Shortform

i want a better conceptual understanding of what "fundamental values" means, and how to disentangled that from beliefs (ex.: in an LLM). like, is there a meaningful way we can say that a "cat classifier" is valuing classifying cats even though it sometimes fail?

nathan-helm-burger on Please stop publishing ideas/insights/research about AI

A bit of a rant, yes, but some good thoughts here.

I agree that unenforceable regulation can be a bad thing. On the other hand, it can also work in some limited ways. For example, the international agreements against heritable human genetic engineering seem to have held up fairly well. But I think that that requires supporting facts about the world to be true. It needs to not be obviously highly profitable to defectors, it needs to be relatively inaccessible to most people (requiring specialized tech and knowledge), it needs to fit with our collective intuitions (bio-engineering humans seems kinda icky to a lot of people).

The trouble is, all of these things fail to help us with the problem of dangerous AI! As you point out, many bitcoin miners have plenty of GPUs to be dangerous if we get even a couple more orders-of-magnitude algorithmic efficiency improvements. So it's accessible. AI and AGI offer many tempting ways to acquire power and money in society. So it's immediately and incrementally profitable. People aren't as widely instinctively outraged by AI experiments as Bio-engineering experiments. So it's not intuitively repulsive.

So yes, this seems to me to be very much a situation in which we should not place any trust in unenforceable regulation.

I also agree that we probably do need some sort of organization which enforces the necessary protections (detection and destruction) against rogue AI.

And it does seem potentially like a lot of human satisfaction could be bought in the near future with a focus on making sure everyone in the world gets a reasonable minimum amount of satisfaction from their physical and social environments as you describe here:

Usually, the median person is interested in: jobs, a full fridge, rituals, culture, the spread of their opinion leader's information, dopamine, political and other random and inherited values, life, continuation of life, and the like. Provide a universal way of obtaining this and just monitor it calmly.

As Connor Leahy has said, we should be able to build sufficiently powerful tool-AI to not need to build AGI! Stop while we still have control! Use the wealth to buy off those who would try anyway. Also, build an enforcement agency to stop runaway AI or AI misuse.

I don't know how we get there from here though.

Also, the offense-dominant weapons development landscape is looking really grim, and I don't see how to easily patch that.

On the other hand, I don't think we buy ourselves any chance of victory by trying to gag ourselves for fear of speeding up AGI development. It's coming soon regardless of what we do! The race is short now, we need to act fast!

I don't buy the arguments that our discussions here will make a significant impact in the timing of the arrival of AGI. That seems like hubris to me, to imagine we have such substantial effects, just from our discussions.

Code? Yes, code can be dangerous and shouldn't be published if so.

Sufficiently detailed technical descriptions of potential advancements? Yeah, I can see that being dangerous.

Unsubstantiated commentary about a published paper being interesting and potentially having both capabilities and alignment value? I am unconvinced that such discussions meaningfully impact the experiments being undertaken in AI labs.

the-gears-to-ascension on My hour of memoryless lucidity

geez, that's certainly a list of chemicals. I wonder what the ratios were - my intuition finds it less surprising for you to be less impaired if no one of them is particularly high dose.

martinkunev on Examples of Highly Counterfactual Discoveries?

I have previously used special relativity as an example to the opposite. It seems to me that the Michelson-Morley experiment laid the groundwork and all alternatives were more or less rejected by the time special relativity was formulated. This could be hindsight bias though.

If nobel prizes are any indicator, then the photoelectric effect is probably more counterfactually impactful than special relativity.

migueldev on CLR's recent work on multi-agent systems

safe Pareto improvement (SPI)

This URL is broken.