LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Can Current LLMs be Trusted To Produce Paperclips Safely?
Rohit Chatterjee (rohit-c) · 2024-08-19T17:17:07.530Z · comments (0)

It is time to start war gaming for AGI
yanni kyriacos (yanni) · 2024-10-17T05:14:17.932Z · comments (1)

Interest poll: A time-waster blocker for desktop Linux programs
nahoj · 2024-08-22T20:44:04.479Z · comments (5)

Madrid - ACX Meetups Everywhere Fall 2024
Pablo Villalobos (pvs) · 2024-08-05T18:36:55.136Z · comments (0)

Bellevue-Redmond USA - ACX Meetups Everywhere Fall 2024
Cedar (xida-ren) · 2024-08-29T18:43:57.014Z · comments (8)

Jailbreaking ChatGPT and Claude using Web API Context Injection
Jaehyuk Lim (jason-l) · 2024-10-21T21:34:37.579Z · comments (0)

Likelihood calculation with duobels
Martin Gerdes (martin-gerdes) · 2024-10-01T16:21:01.268Z · comments (0)

[link] A Logical Proof for the Emergence and Substrate Independence of Sentience
rife (edgar-muniz) · 2024-10-24T21:08:09.398Z · comments (31)

[question] Is there a known method to find others who came across the same potential infohazard without spoiling it to the public?
hive · 2024-10-17T10:47:05.099Z · answers+comments (6)

[question] How do you follow AI (safety) news?
PeterH · 2024-09-24T13:58:48.916Z · answers+comments (2)

[link] Predictions as Public Works Project — What Metaculus Is Building Next
ChristianWilliams · 2024-10-22T16:35:13.999Z · comments (0)

Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation
Antonio Clarke (antonio-clarke) · 2024-09-29T18:48:23.308Z · comments (0)

For Limited Superintelligences, Epistemic Exclusion is Harder than Robustness to Logical Exploitation
Lorec · 2024-09-15T20:49:06.370Z · comments (9)

[question] Calibration training for 'percentile rankings'?
david reinstein (david-reinstein) · 2024-09-14T21:51:55.705Z · answers+comments (0)

[link] Thoughts On Democracy
Zero Contradictions · 2024-08-04T06:02:07.601Z · comments (0)

Collapsing “Collapsing the Belief/Knowledge Distinction”
Jeremias (jeremias-sur) · 2024-09-20T16:11:33.558Z · comments (0)

On Measuring Intellectual Performance - personal experience and several thoughts
Alexander Gufan (alexander-gufan) · 2024-09-20T17:21:19.747Z · comments (2)

MIT FutureTech are hiring for a Technical Associate role
peterslattery · 2024-09-09T20:16:49.299Z · comments (0)

[link] Levers for Biological Progress - A Response to "Machines of Loving Grace"
Niko_McCarty (niko-2) · 2024-11-01T16:35:08.221Z · comments (0)

[question] Is it Legal to Maintain Turing Tests using Data Poisoning, and would it work?
Double · 2024-09-05T00:35:39.504Z · answers+comments (9)

Dallas USA - ACX Meetups Everywhere Fall 2024
ethanmorse · 2024-08-29T18:43:37.972Z · comments (0)

San Francisco ACX Meetup “First Saturday”
Nate Sternberg (nate-sternberg) · 2024-09-29T03:13:34.615Z · comments (0)

St. Paul USA - ACX Meetups Everywhere Fall 2024
25Hour (aaron-kaufman) · 2024-08-29T18:42:21.899Z · comments (3)

Cambridge USA - ACX Meetups Everywhere Fall 2024
Screwtape · 2024-08-29T18:42:06.849Z · comments (0)

San Jose USA - ACX Meetups Everywhere Fall 2024
David Friedman (david-friedman) · 2024-08-29T18:40:36.215Z · comments (0)

San Francisco USA - ACX Meetups Everywhere Fall 2024
Andrew Gaul (andrew gaul) · 2024-08-29T18:40:30.097Z · comments (0)

Berkeley USA - ACX Meetups Everywhere Fall 2024
Screwtape · 2024-08-29T18:39:50.532Z · comments (0)

Huntsville USA - ACX Meetups Everywhere Fall 2024
blackstampede · 2024-08-29T18:39:37.288Z · comments (0)

London United Kingdom - ACX Meetups Everywhere Fall 2024
Edward Saperia (edward saperia) · 2024-08-29T18:38:55.958Z · comments (0)

Near-death experiences
Declan Molony (declan-molony) · 2024-10-08T06:34:04.107Z · comments (1)

Hamiltonian Dynamics in AI: A Novel Approach to Optimizing Reasoning in Language Models
Javier Marin Valenzuela (javier-marin-valenzuela) · 2024-10-09T19:14:56.162Z · comments (0)

Hamburg Germany - ACX Meetups Everywhere Fall 2024
Gunnar (gunnar ) · 2024-08-29T18:37:11.622Z · comments (0)

AI Compute governance: Verifying AI chip location
Farhan · 2024-10-12T17:36:45.942Z · comments (0)

Personal Philosophy
Xor · 2024-10-13T03:01:59.324Z · comments (0)

Copenhagen Denmark - ACX Meetups Everywhere Fall 2024
SoerenE · 2024-08-29T18:36:15.414Z · comments (0)

Prague Czech Republic - ACX Meetups Everywhere Fall 2024
Jiří Nádvorník (jiri-nadvornik) · 2024-08-29T18:36:11.861Z · comments (0)

Auckland New Zealand - ACX Meetups Everywhere Fall 2024
Mark Gilmour (mark-gilmour) · 2024-08-29T18:35:31.852Z · comments (0)

On the Practical Applications of Interpretability
Nick Jiang (nick-jiang) · 2024-10-15T17:18:25.280Z · comments (0)

Bellevue Meetup
Cedar (xida-ren) · 2024-10-16T01:07:58.761Z · comments (0)

[question] EndeavorOTC legit?
FinalFormal2 · 2024-10-17T01:33:12.606Z · answers+comments (0)

Cape Town South Africa - ACX Meetups Everywhere Fall 2024
moyamo · 2024-08-29T18:28:24.579Z · comments (0)

Leverage points for a pause
Remmelt (remmelt-ellen) · 2024-08-28T09:21:17.593Z · comments (0)

[link] Podcast discussing Hanson's Cultural Drift Argument
vaishnav92 · 2024-10-20T17:58:41.416Z · comments (0)

Transformers Explained (Again)
RohanS · 2024-10-22T04:06:33.646Z · comments (0)

[question] Should LW suggest standard metaprompts?
Dagon · 2024-08-21T16:41:07.757Z · answers+comments (6)

Vilnius – ACX Meetups Everywhere Fall 2024
NoUsernameSelected · 2024-08-19T17:38:12.378Z · comments (1)

Playing Minecraft with a Superintelligence
Johannes C. Mayer (johannes-c-mayer) · 2024-08-17T22:47:42.767Z · comments (0)

Alignment from equivariance
hamishtodd1 · 2024-08-13T21:09:11.849Z · comments (0)

Interview with Bill O’Rourke - Russian Corruption, Putin, Applied Ethics, and More
JohnGreer · 2024-10-27T17:11:28.891Z · comments (0)

Your memory eventually drives confidence in each hypothesis to 1 or 0
Crazy philosopher (commissar Yarrick) · 2024-10-28T09:00:27.084Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

tapatakt on I turned decision theory problems into memes about trolleys

I really like it! One remark, though: two upper tracks must be swapped, otherwise it's possible to precommit by staying in place and not running to the lever.

anthonyc on Both-Sidesism—Where Fair & Balanced Goes Wrong

TBH one of the things I always wonder about is not so much the "sidesism" as the "both." How are people deciding what should count as a side, and why there should be two? And when should something no longer count as a side?

I mean, I get it in practice, there's nothing this self-reflective going on at all and it's all decided by inertia, FPTP voting, and revenue. I still would naively have expected more people on the audience side to have the realization that:

this didn’t seem to fit into the Headmaster’s list; and it occurred to Hermione that there might be a lot more viewpoints on the subject than just four.

tailcalled on What can we learn from insecure domains?

No, it was a lot of words that describe why your strategy of modelling stuff as more/less "dangerous" and then trying to calibrate to how much to be scared of "dangerous" stuff doesn't work.

The better strategy, if you want to pursue this general line of argument, is to make the strongest argument you can for what makes e.g. Bitcoin so dangerous and how horrible the consequences will be. Then since your sense of danger overestimates how dangerous Bitcoin will be, you can go in and empirically investigate where your intuition was wrong by seeing what predictions of your intuitive argument failed and what obstacles caused them to fail.

logan-zoellner on What can we learn from insecure domains?

That was a lot of words to say "I don't think anything can be learned here".

Personally, I think something can be learned here.

tailcalled on What can we learn from insecure domains?

You shouldn't use "dangerous" or "bad" as a latent variable because it promotes splitting. MAD and Bitcoin have fundamentally different operating principles (e.g. nuclear fission vs cryptographic pyramid schemes), and these principles lead to a mosaic of different attributes. If you ignore the operating principles and project down to a bad/good axis, then you can form some heuristics about what to seek out or avoid, but you face severe model misspecification, violating principles like realizability which are required for Bayesian inference to get reasonable results (e.g. converge rather than oscillate, and be well-calibrated rather than massively overconfident).

Once you understand the essence of what makes a domain seem dangerous to you, you can debug by looking at what obstacles this essence faced that stopped it from flowing into whatever horrors you were worried about, and then try to think through why you didn't realize those obstacles ahead of time. As you learn more about the factors relevant in those cases, maybe you will learn something that generalizes across cases, but most realistically what you learn will be about the problems with the common sense.

anthonyc on Science advances one funeral at a time

I think the issue here is not so much the disagreement or criticism as it is the mockery and ostracism. Unlike in, say, venture capital, there's much less opportunity in science for someone to try something different and exciting, get enough funding to see if it really works out, and then, if it doesn't but you were doing a good enough job trying, still be part of the community and get funding to try something else. (Yes, I know it doesn't always work that way either, but I think the odds are much better than in science)

vanessa-kosoy on Complete Feedback

I feel that this post would benefit from having the math spelled out. How is inserting a trader a way to do feedback? Can you phrase classical RL like this?

logan-zoellner on What can we learn from insecure domains?

MAD is obviously governed by completely different principles than crypto is

Maybe this is obvious to you. It is not obvious to me. I am genuinely confused what is going on here. I see what seems to be a pattern: dangerous domain -> basically okay. And I want to know what's going on.

anthonyc on Two arguments against longtermist thought experiments

I appreciate the discussion, but I can't help but be distracted by the specifics of the example scenario. In this case, it just seems obvious to me that the correct answer is to bury the waste and then invest in developing better processing solutions. There's no such thing as waste that can't be safely processed, even in principle, with a century of lead time to prepare. When I read the first few sentences, I actually thought the counterargument was going to be about uncertainty in long term impact projections.

arthur-conmy on IAPS: Mapping Technical Safety Research at AI Companies

Here are the other GDM mech interp papers missed:
We have some blog posts of comparable standard to the Anthropic circuit updates listed:
- https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/full-post-progress-update-1-from-the-gdm-mech-interp-team [AF · GW]
- https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall [AF · GW]
You use a very wide scope for the "enhancing human feedback" (basically any post-training paper mentioning 'align'-ing anything). So I will use a wide scope for what counts as mech interp and also include:
- https://arxiv.org/abs/2401.06102
- https://arxiv.org/abs/2304.14767
- There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren't counting pre-GDM merger Google Research/Brain work