LessWrong 2.0 Reader

View: New · Old · Top

← previous page (newer posts) · next page (older posts) →

ARENA 2.0 - Impact Report
CallumMcDougall (TheMcDouglas) · 2023-09-26T17:13:19.952Z · comments (5)

Mechanistic Interpretability Reading group
1stuserhere (firstuser-here) · 2023-09-26T16:26:44.757Z · comments (0)

Announcing the CNN Interpretability Competition
scasper · 2023-09-26T16:21:50.276Z · comments (0)

Making AIs less likely to be spiteful
Nicolas Macé (NicolasMace) · 2023-09-26T14:12:06.202Z · comments (2)

[link] [Linkpost] Mark Zuckerberg confronted about Meta's Llama 2 AI's ability to give users detailed guidance on making anthrax - Business Insider
mic (michael-chen) · 2023-09-26T12:05:57.396Z · comments (11)

Enforcing Far-Future Contracts for Governments
FCCC · 2023-09-26T04:26:46.442Z · comments (49)

Carioca Petrov Day
Giskard (tiago-macedo) · 2023-09-26T00:30:36.906Z · comments (0)

[question] A few Alignment questions: utility optimizers, SLT, sharp left turn and identifiability
Igor Timofeev (igor-timofeev-1) · 2023-09-26T00:27:23.229Z · answers+comments (1)

Impact stories for model internals: an exercise for interpretability researchers
jenny · 2023-09-25T23:15:29.189Z · comments (3)

[link] Autonomic Sanity
Sable · 2023-09-25T22:37:07.262Z · comments (9)

[question] What is wrong with this "utility switch button problem" approach?
Donald Hobson (donald-hobson) · 2023-09-25T21:36:47.166Z · answers+comments (3)

You should just smile at strangers a lot
chaosmage · 2023-09-25T20:12:56.907Z · comments (10)

[link] The King and the Golem
Richard_Ngo (ricraz) · 2023-09-25T19:51:22.980Z · comments (15)

[link] Public Opinion on AI Safety: AIMS 2023 and 2021 Summary
Jacy Reese Anthis (Jacy Reese) · 2023-09-25T18:55:41.532Z · comments (2)

Welcome to Apply: The 2024 Vitalik Buterin Fellowships in AI Existential Safety by FLI!
Zhijing Jin · 2023-09-25T18:42:13.320Z · comments (2)

Evaluating hidden directions on the utility dataset: classification, steering and removal
Annah (annah) · 2023-09-25T17:19:13.988Z · comments (3)

Linkpost: A model of biases as arising from meta-beliefs
JuanGarcia · 2023-09-25T17:14:55.538Z · comments (0)

[question] What causes a decision theory to be used?
Dagon · 2023-09-25T16:33:36.161Z · answers+comments (2)

[link] Understanding strategic deception and deceptive alignment
Marius Hobbhahn (marius-hobbhahn) · 2023-09-25T16:27:47.357Z · comments (16)

[link] The Merits of Contrarianism & Why I hate Chatbots. [My Experience with the Ideological Turing Test @ a Less Wrong meetup]
Amina V. (aminah-vinson) · 2023-09-25T16:13:04.113Z · comments (1)

Inside Views, Impostor Syndrome, and the Great LARP
johnswentworth · 2023-09-25T16:08:17.040Z · comments (53)

“X distracts from Y” as a thinly-disguised fight over group status / politics
Steven Byrnes (steve2152) · 2023-09-25T15:18:18.644Z · comments (14)

[link] Amazon to invest up to $4 billion in Anthropic
Davis_Kingsley · 2023-09-25T14:55:35.983Z · comments (8)

Should Effective Altruists be Valuists instead of utilitarians?
spencerg · 2023-09-25T14:03:10.958Z · comments (3)

Feedly Breaks MathML
jefftk (jkaufman) · 2023-09-25T13:40:05.759Z · comments (3)

[question] How have you become more hard-working?
Chi Nguyen · 2023-09-25T12:37:39.860Z · answers+comments (40)

Automating Intelligence: A Cursory Glance at How AutoML Brings Precision to AI Development
[deleted] · 2023-09-25T09:39:31.338Z · comments (0)

[link] Categorization Hell
UtilityMonster (Matt Goldwater) · 2023-09-24T18:18:03.136Z · comments (0)

Interpreting OpenAI's Whisper
EllenaR · 2023-09-24T17:53:44.955Z · comments (10)

Contradiction Appeal Bias
onur · 2023-09-24T17:03:58.724Z · comments (2)

RAIN: Your Language Models Can Align Themselves without Finetuning - Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!
Singularian2501 (maik-zywitza) · 2023-09-24T16:48:18.360Z · comments (0)

Honor System for Vaccination?
jefftk (jkaufman) · 2023-09-24T11:50:05.809Z · comments (22)

Far-Future Commitments as a Policy Consensus Strategy
FCCC · 2023-09-24T06:34:55.505Z · comments (40)

Five neglected work areas that could reduce AI risk
[deleted] · 2023-09-24T02:03:29.829Z · comments (5)

[question] Are the other Rationality: A-Z sequences coming out as books?
caffeinated_dissonance (alex-goldstein) · 2023-09-24T00:38:51.939Z · answers+comments (2)

The Dick Kick'em Paradox
Augs SMSHacks (augs-smshacks) · 2023-09-23T22:22:06.827Z · comments (21)

I designed an AI safety course (for a philosophy department)
Eleni Angelou (ea-1) · 2023-09-23T22:03:00.036Z · comments (15)

[link] Paper: LLMs trained on “A is B” fail to learn “B is A”
lberglund (brglnd) · 2023-09-23T19:55:53.427Z · comments (73)

Sparse Coding, for Mechanistic Interpretability and Activation Engineering
David Udell · 2023-09-23T19:16:31.772Z · comments (7)

[question] Places to meet interesting middle-aged men?
anon_girl · 2023-09-23T19:06:48.829Z · answers+comments (7)

Taking features out of superposition with sparse autoencoders more quickly with informed initialization
Pierre Peigné (pierre-peigne) · 2023-09-23T16:21:42.799Z · comments (8)

A quick remark on so-called “hallucinations” in LLMs and humans
Bill Benzon (bill-benzon) · 2023-09-23T12:17:26.600Z · comments (4)

Hand-writing MathML
jefftk (jkaufman) · 2023-09-23T11:20:07.870Z · comments (40)

Musk, Starlink, and Crimea
NicholasKross · 2023-09-23T02:35:02.623Z · comments (0)

[link] [Linkpost/Video] All The Times We Nearly Blew Up The World
Jacob G-W (g-w1) · 2023-09-23T01:18:03.008Z · comments (1)

Luck based medicine: inositol for anxiety and brain fog
Elizabeth (pktechgirl) · 2023-09-22T20:10:07.117Z · comments (5)

If influence functions are not approximating leave-one-out, how are they supposed to help?
Fabien Roger (Fabien) · 2023-09-22T14:23:45.847Z · comments (4)

Modeling p(doom) with TrojanGDP
K. Liam Smith (Liam Smith) · 2023-09-22T14:19:31.437Z · comments (2)

Let's talk about Impostor syndrome in AI safety
Igor Ivanov (igor-ivanov) · 2023-09-22T13:51:18.482Z · comments (4)

Fund Transit With Development
jefftk (jkaufman) · 2023-09-22T11:10:05.645Z · comments (22)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

cata on LessOnline Festival Updates Thread

That just sounds great, thanks.

niplav on LessOnline Festival Updates Thread

That's unfortunate that you are less likely to come, and I'm glad to get the feedback. I could primarily reply with reasons why I think it was the right call (e.g. helpful for getting the event off the ground, helpful for pinpointing the sort of ideas+writing the event is celebrating, I think it's prosocial for me to be open about info like this generally, etc) but I don't think that engages with the fact that it left you personally less likely to come. I still overall think if the event sounds like a good time to you (e.g. interesting conversations with people you'd like to talk to and/or exciting activities) and it's worth the cost to you then I hope you come :-)

Maybe to clarify my comment: I was merely describing my (non-endorsed^[1]) observed emotional content wrt the festival, and my intention with thw comment was not to wag my finger at you guys in the manner of "you didn't invite me".

I wonder whether other people have a similar emotional reaction.

I appreciate Lightcone being open with the information around free invitations though! I think I'd have bought a ticket anyway if I had time around that weekend, and I think I'd probably have a blast if I would attend.

Btw: What's the chance of a 2nd LessOnline?

I think my reaction is super bound up in icky status-grabbing/status-desiring/inner-ring-infiltrating parts of my psyche which I'm not happy with. ↩︎

nevin-wetherill on Why I'm not doing PauseAI

I'm pretty sure GPT-N won't be able to do it, assuming they follow the same paradigm.

I am curious if you would like to expand on this intuition? I do not share it, and it seems like one potential crux.

I do not share this intuition. I would hope that if I say a handful of words about synthetic data, that will be sufficient to move your imagination into a less certain condition regarding this assertion. I am tempted to try something else first.

Is this actually important to your argument? I do not see how it would end up factoring into this problem, except by more quickly obviating the advances made with understanding and steering LLM behavior. What difference does it make to the question of "stop" if instead of LLMs in a GPT wrapper, the thing that can in fact solve that task in blender is some RNN-generating/refining action-token optimizer?

"LLMs can't do X" doesn't mean X is going to take another 50 years. The field is red hot right now. In many ways new approaches to architecture are vastly easier to iterate on than new bio-sciences work, and those move blazingly fast compared to things like high energy/nuclear/particle physics experiments - and even those sometimes outpace regulatory bodies' abilities to assess and ensure safety. The first nuclear pile got built under some bleachers on a campus.

Even if you're fully in Gary Marcus's camp on criticism of the capabilities of LLMs, his prescriptions for fixing it don't rule out another approach qualitatively similar to transformers that isn't any better for making alignment easy. There's a gap in abstract conceptualization here, where we can - apparently - make things which represent useful algorithms while not having a solid grasp on the mechanics and abstract properties of those algorithms. The upshot of pausing is that we enter into a period of time where our mastery becomes deeper and broader while the challenges we are using it to address remain crisp, constrained, and well within a highly conservative safety-margin.

How is it obvious that we are far away in time? Certain emergency options like centralized compute resources under international monitoring are going to be on long critical paths, and if someone has a brilliant idea for [Self-Censored, To Avoid Being Called Dumb & Having That Be Actually True] and that thing destroys the world before you have all AI training happening in monitored data centers with some totally info-screened black-box fail-safes - then you end up not having a ton of "opportunity cost" compared to the counterfactual where you prevented the world getting paper-clipped because you were willing, in that counterfactual, to ever tell anyone "no, stop" with the force of law behind it.

Seriously,

by stopping AI progress, we lose all the good stuff that AI would lead to

... that's one side of the cost-benefit analysis over counterfactuals. Hesitance over losing even many billions of dollars in profits should not stop us from preventing the end of the world.

"The average return from the urn is irrelevant if you're not allowed to play anymore!" (quote @ 1:08:10, paraphrasing Nassim Taleb)

not having any reference AI to base our safety work on

Seems like another possible crux. This seems to imply that either there has been literally no progress on real alignment up to this point, or you are making a claim about the marginal returns on alignment work before having scary-good systems.

Like, the world I think I see is one where alignment has been sorely underfunded, but even prior to the ML revolution there was good alignment de-confusion work that got done. Having the entire conceptual framing of "alignment" and resources like Arbital's catalogue pre-2022 and "Concrete Problems in AI Safety" and a bunch of other things all seem like incremental progress towards making a world in which one could attempt to build an AI framework -> AGI instantiation -> ASI direct-causal-descendant and have that endeavor not essentially multiply human values by 0 on almost every dimension in the long run.

Why can't we continue this after liquid nitrogen gets poured onto ML until the whole thing freezes and shatters into people bickering about lost investments? Would we expect a better ratio of good/bad outcomes on our lottery prize wheel in 50 years after we've solved the "AI Pause Button Problem" and "Generalized Frameworks for Robust Corrigibility" and "Otherizing/Satisficing/Safe Maximization?" There seems to be a lot of blueprinting, rocket equation, Newtonian mechanics, astrophysics type work we can do even if people can't make 10 billion dollars 5 years from now selling GPT-6 powered products.

It's not that easy for an unassisted AI to do harm - especially existentially significant harm.

I am somewhat baffled by this intuition.

I suspect what's going on here is that the more harm something is proposed to be capable of, the less likely people think that it is.

Say you're driving fast down a highway, what do you think a split second after seeing a garbage truck pull out in front of you while you are traveling towards it with >150km/hr relative velocity? Say your brain could generate in that moment a totally reflectively coherent probability distribution over expected outcomes. Does the distribution go from the most probability mass in scenarios with the least harm to the least probability mass in scenarios with the most harm? "Ah, it's fine," you think, "it would be weird if this killed me instantly, less weird if I merely had a spinal injury, and even less weird if I simply broke my nose and bruised my sternum."

The underlying mechanism - the actual causal processes involved in determining the future arrangements of atoms or the amount of reality fluid in possible Everett Branch futures grouped by similarity in features - that's what you have to pay attention to. What you find difficult to plan for, or what you observe humans having difficulty planning for, does not mean you can map that same difficulty curve onto AI. AlphaZero did not experience the process of getting better at the games it played in the same way humanity experienced that process. It did not have to spend 26 IRL years learning the game painstakingly from traditions established over hundreds of IRL years - it did not have to struggle to sleep well and eat healthy and remain clean from vices and motivated in order to stay on task and perform at its peak capacity. It didn't even need to solve the problem perfectly - like representing "life and death" robustly - in order to in reality beat the top humans and most (or all, modulo the controversy over StockFish being tested in a suboptimal condition) of the top human engines.

It doesn't seem trivial for a certain value of the word "trivial." Still, I don't see how this consideration gives anyone much confidence in it qualitatively being "really tough" the way getting a rocket carrying humans to Mars is tough - where you don't one day just get the right lines of code into a machine and suddenly the cipher descrambles in 30 seconds when before it wouldn't happen no matter how many random humans you had try to guess the code or how many hours other naively written programs spent attempting to brute-force it.

Sometimes you just hit enter, kick a snowball at the top of a mountain, and 1s and 0s click away, and an avalanche comes down in a rush upon the schoolhouse 2 km below your skiing trail. The badness of the outcome didn't matter one bit to its probability of occuring in those real world conditions in which it occured. The outcome depended merely on the actual properties of the physical universe, and what effects descend from which causes. See Beyond The Reach of God [LW · GW] for an excellent extended meditation on this reality.

unexpectedvalues on Some Experiments I'd Like Someone To Try With An Amnestic

Yeah, that's my best guess. I have other memories from that period (which was late into the hour), so I think it was the drug wearing off, rather than learning effects.

chris_leong on Does reducing the amount of RL for a given capability level make AI safer?

Oh, this is a fascinating perspective.

So most uses of RL already just use a small-bit of RL.

So if the goal was "only use a little bit of RL", that's already happening.

Hmm... I still wonder if using even less RL would be safer still.

michaeldickens on Which skincare products are evidence-based?

What do you think is the strongest evidence on sunscreen? I've read mixed things on its effectiveness.

carl-feynman on yanni's Shortform

This question is two steps removed from reality. Here’s what I mean by that. Putting brackets around each of the two steps:

what is the threshold that needs meeting [for the majority of people in the EA community] [to say something like] "it would be better if EAs didn't work at OpenAI"?

Without these steps, the question becomes

What is the threshold that needs meeting before it would be better if people didn’t work at OpenAI?

Personally, I find that a more interesting question. Is there a reason why the question is phrased at two removes like that? Or am I missing the point?

d0themath on D0TheMath's Shortform

Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict?

An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America's enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America's enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike.

The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up.

And the same logic works for other irrecoverably large technological leaps in war. If the US is on the brink of developing highly militaristically capable AIs, China will fear what the US will do with them (imagine if the tables were turned, would you feel safe with Anthropic & OpenAI in China, and DeepMind in Russia?), so if they don't get their own versions they'll feel mounting pressure to secure their geopolitical objectives while they still can, or otherwise make themselves less subject to the threat of AI (would you not wish the US would sabotage the Chinese Anthropic & OpenAI by whatever means if China seemed on the brink?). The fast the development, the quicker the pressure will get, and the more sloppy & rash China's responses will be. If its easy for China to copy our AI technology, then there's much slower mounting pressure.

bogdan-ionut-cirstea on Bogdan Ionut Cirstea's Shortform

Also, TC0 is very much limited, see e.g. this presentation.

bogdan-ionut-cirstea on Bogdan Ionut Cirstea's Shortform

I think I remember William Merrill (in a video) pointing out that the rational inputs assumption seems very unrealistic (would require infinite memory); and, from what I remember, https://arxiv.org/abs/2404.15758 and related papers made a different assumption about the number of bits of memory per parameter and per input.