LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Aversion Factoring
CFAR!Duncan (CFAR 2017) · 2022-07-07T16:09:11.392Z · comments (1)

A Pattern Language For Rationality
Vaniver · 2022-07-05T19:08:49.783Z · comments (14)

Which values are stable under ontology shifts?
Richard_Ngo (ricraz) · 2022-07-23T02:40:04.344Z · comments (48)

[link] A time-invariant version of Laplace's rule
Jsevillamol · 2022-07-15T19:28:15.877Z · comments (13)

Principles of Privacy for Alignment Research
johnswentworth · 2022-07-27T19:53:28.209Z · comments (30)

[link] NeurIPS ML Safety Workshop 2022
Dan H (dan-hendrycks) · 2022-07-26T15:28:52.441Z · comments (2)

Cognitive Risks of Adolescent Binge Drinking
Elizabeth (pktechgirl) · 2022-07-20T21:10:01.513Z · comments (12)

Avoid the abbreviation "FLOPs" – use "FLOP" or "FLOP/s" instead
Daniel_Eth · 2022-07-10T10:44:38.046Z · comments (13)

Abstracting The Hardness of Alignment: Unbounded Atomic Optimization
adamShimi · 2022-07-29T18:59:49.460Z · comments (3)

My vision of a good future, part I
Jeffrey Ladish (jeff-ladish) · 2022-07-06T01:23:01.074Z · comments (18)

Curating "The Epistemic Sequences" (list v.0.1)
Andrew_Critch · 2022-07-23T22:17:08.544Z · comments (12)

Applications are open for CFAR workshops in Prague this fall!
John Steidley (JohnSteidley) · 2022-07-19T18:29:19.172Z · comments (3)

Taste & Shaping
CFAR!Duncan (CFAR 2017) · 2022-07-10T05:50:14.416Z · comments (1)

What's next for instrumental rationality?
Andrew_Critch · 2022-07-23T22:55:06.185Z · comments (7)

Response to Blake Richards: AGI, generality, alignment, & loss functions
Steven Byrnes (steve2152) · 2022-07-12T13:56:00.885Z · comments (9)

Introducing the Fund for Alignment Research (We're Hiring!)
AdamGleave · 2022-07-06T02:07:47.965Z · comments (0)

[link] My Most Likely Reason to Die Young is AI X-Risk
AISafetyIsNotLongtermist · 2022-07-04T17:08:27.209Z · comments (24)

Conditioning Generative Models for Alignment
Jozdien · 2022-07-18T07:11:46.369Z · comments (8)

When Giving People Money Doesn’t Help
Zvi · 2022-07-07T13:00:00.879Z · comments (12)

A Bias Against Altruism
Lone Pine (conor-sullivan) · 2022-07-23T20:44:59.964Z · comments (30)

[link] Deep learning curriculum for large language model alignment
Jacob_Hilton · 2022-07-13T21:58:33.452Z · comments (3)

The Reader's Guide to Optimal Monetary Policy
Ege Erdil (ege-erdil) · 2022-07-25T15:10:51.010Z · comments (10)

Double Crux
CFAR!Duncan (CFAR 2017) · 2022-07-24T06:34:15.305Z · comments (9)

Deception?! I ain’t got time for that!
Paul Colognese (paul-colognese) · 2022-07-18T00:06:15.274Z · comments (5)

[AN #172] Sorry for the long hiatus!
Rohin Shah (rohinmshah) · 2022-07-05T06:20:03.943Z · comments (0)

Don't take the organizational chart literally
lc · 2022-07-21T00:56:28.561Z · comments (21)

Making decisions using multiple worldviews
Richard_Ngo (ricraz) · 2022-07-13T19:15:02.621Z · comments (10)

[link] Acceptability Verification: A Research Agenda
David Udell · 2022-07-12T20:11:34.986Z · comments (0)

Race Along Rashomon Ridge
Stephen Fowler (LosPolloFowler) · 2022-07-07T03:20:59.701Z · comments (15)

Outer vs inner misalignment: three framings
Richard_Ngo (ricraz) · 2022-07-06T19:46:50.902Z · comments (5)

Report from a civilizational observer on Earth
owencb · 2022-07-09T17:26:09.223Z · comments (12)

Comfort Zone Exploration
CFAR!Duncan (CFAR 2017) · 2022-07-15T21:18:14.033Z · comments (2)

Announcing the AI Safety Field Building Hub, a new effort to provide AISFB projects, mentorship, and funding
Vael Gates · 2022-07-28T21:29:52.424Z · comments (3)

Goal Alignment Is Robust To the Sharp Left Turn
Thane Ruthenis · 2022-07-13T20:23:58.962Z · comments (16)

Potato diet: A post mortem and an answer to SMTM's article
Épiphanie Gédéon (joy_void_joy) · 2022-07-14T23:18:26.691Z · comments (34)

The Alignment Problem
lsusr · 2022-07-11T03:03:03.271Z · comments (18)

Babysitting as Parenting Trial?
jefftk (jkaufman) · 2022-07-07T13:20:04.129Z · comments (19)

[link] The Most Important Century: The Animation
Writer · 2022-07-24T20:58:55.869Z · comments (2)

Eavesdropping on Aliens: A Data Decoding Challenge
anonymousaisafety · 2022-07-24T04:35:40.880Z · comments (9)

Tarnished Guy who Puts a Num on it
Jacob Falkovich (Jacobian) · 2022-07-06T18:05:59.168Z · comments (11)

Artificial Sandwiching: When can we test scalable alignment protocols without humans?
Sam Bowman (sbowman) · 2022-07-13T21:14:08.145Z · comments (6)

Safety considerations for online generative modeling
Sam Marks (samuel-marks) · 2022-07-07T18:31:19.316Z · comments (9)

The curious case of Pretty Good human inner/outer alignment
PavleMiha · 2022-07-05T19:04:49.434Z · comments (45)

Systemization
CFAR!Duncan (CFAR 2017) · 2022-07-11T18:39:04.750Z · comments (5)

[link] [Linkpost] Existential Risk Analysis in Empirical Research Papers
Dan H (dan-hendrycks) · 2022-07-02T00:09:49.399Z · comments (0)

Bucket Errors
CFAR!Duncan (CFAR 2017) · 2022-07-29T18:50:48.549Z · comments (7)

[link] QNR Prospects
PeterMcCluskey · 2022-07-16T02:03:37.258Z · comments (3)

Covid 7/14/22: BA.2.75 Plus Tax
Zvi · 2022-07-14T14:40:00.587Z · comments (9)

Mosaic and Palimpsests: Two Shapes of Research
adamShimi · 2022-07-12T09:05:28.984Z · comments (3)

Predicting Parental Emotional Changes?
jefftk (jkaufman) · 2022-07-06T13:50:04.387Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lorxus on romeostevensit's Shortform

Any recommendations on how I should do that? You may assume that I know what a gas chromatograph is and what a Petri dish is and why you might want to use either or both of those for data collection, but not that I have any idea of how to most cost-effectively access either one as some rando who doesn't even have a MA in Chemistry.

deluks917 on Stephen Fowler's Shortform

A serious effective altruism movement with clean house. Everyone who pushed the 'work with AI capabilities company' line should retire or be forced to retire. There is no need to blame anyone for mistakes, the decision makers had reasons. But they chose wrong and should not continue to be leaders.

weightt-an on LLMs could be as conscious as human emulations, potentially

Each of the transformation steps described in the post reduces my expectation that the result would be conscious somewhat.

Well, it's like saying if the {human in a car as a single system} is or is not conscious. Firstly it's a weird question, because of course it is. And even if you chain the human to a wheel in such a way they will never disjoin from the car.

What I did is constrained possible actions of the human emulation. Not severely, the human still can talk whatever, just with constant compute budget, time or iterative commutation steps. Kind of like you can constrain actions of a meaty human by putting them in a jail or something. (... or in a time loop / repeated complete memory wipes)

No, I don't think it would be "what the fuck" surprising if an emulation of a human brain was not conscious.

How would you expect to this possibly cash out? Suppose there are human emulations running around doing all things exactly like meaty humans. How exactly do you expect that announcement of a high scientific council go, "We discovered that EMs are not conscious* because .... and that's important because of ...". Is that completely out of model for you? Or like, can you give me (even goofy) scenario out of that possibility

Or do you think high resolution simulations will fail to replicate capabilities of humans, outlook of them? I.e special sauce/quantum fuckery/literal magic?

eggsyntax on Language Models Model Us

Absolutely! @Jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I've inexplicably failed to thank Arun at the end of my post, need to fix that).

Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.

eggsyntax on Language Models Model Us

Oh thanks, I'd missed that somehow & thought that only the temp mattered for that.

eggsyntax on Language Models Model Us

Thanks!

eggsyntax on Language Models Model Us

That used to work, but as of March you can only get the pre-logit_bias logprobs back. They didn't announce the change, but it's discussed in the OpenAI forums eg here. I noticed the change when all my code suddenly broke; you can still see remnants of that approach in the code.

akash-wasil on DeepMind's "Frontier Safety Framework" is weak and unambitious

@Zach Stein-Perlman [LW · GW] I appreciate your recent willingness to evaluate and criticize safety plans from labs. I think this is likely a public good that is underprovided, given the strong incentives that many people have to maintain a good standing with labs (not to mention more explicit forms of pressure applied by OpenAI and presumably other labs).

One thought: I feel like the difference between how you described the Anthropic RSP and how you described the OpenAI PF feels stronger than the actual quality differences between the documents. I agree with you that the thresholds in the OpenAI PF are too high, but I think the PF should get "points" for spelling out risks that go beyond ASL-3/misuse.

OpenAI has commitments that are insufficiently cautious for ASL-4+ (or what they would call high/critical on model autonomy), but Anthropic circumvents this problem by simply refusing to make any commitments around ASL-4 (for now).

You note this limitation when describing Anthropic's RSP, but you describe it as "promising" while describing the PF as "unpromising." In my view, this might be unfairly rewarding Anthropic for just not engaging with the hardest parts of the problem (or unfairly penalizing OpenAI for giving their best guess answers RE how to deal with the hardest parts of the problem).

We might also just disagree on how firm or useful the commitments in each document are– I walked away with a much better understanding of how OpenAI plans to evaluate & handle risks than how Anthropic plans to handle & evaluate risks. I do think OpenAI's thresholds are too high, but It's likely that I'll feel the same way about Anthropic's thresholds. In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities (partially because the competitive pressures and race dynamics pushing them to do so will be intense). I don't see evidence that either lab is taking these kinds of risks particularly seriously, has ideas about what safeguards would be considered sufficient, or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a "we will just solve it with empiricism" perspective) before we are confident that we can control such systems.

TLDR: I would remove the word "promising" and maybe characterize the RSP more like "an initial RSP that mostly spells out high-level reasoning, makes few hard commitments, and focuses on misuse while missing the all-important evals and safety practices for ASL-4."

keltan on Why you should learn a musical instrument

I haven’t, but I’ll take a look. I appreciate the recommendation!

8e9 on AISafety.com – Resources for AI Safety

Some thoughts about the intro video: it's excellent but ends a little abruptly; it would be good to explain how AISafety.com plans to help. Also, I quite dislike flashing images (around the 45-second mark).