LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

~80 Interesting Questions about Foundation Model Agent Safety
RohanS · 2024-10-28T16:37:04.713Z · comments (4)

Work with me on agent foundations: independent fellowship
Alex_Altair · 2024-09-21T13:59:16.706Z · comments (5)

[link] What Ketamine Therapy Is Like
Sable · 2024-11-11T11:09:08.602Z · comments (5)

We ran an AI safety conference in Tokyo. It went really well. Come next year!
Blaine (blaine-rogers) · 2024-07-17T06:55:39.620Z · comments (1)

The Gemini Incident Continues
Zvi · 2024-02-27T16:00:05.648Z · comments (6)

We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap
johnswentworth · 2024-09-19T22:22:05.307Z · comments (47)

AI #54: Clauding Along
Zvi · 2024-03-07T16:00:05.066Z · comments (11)

[link] Loneliness and suicide mitigation for students using GPT3-enabled chatbots (survey of Replika users in Nature)
Kaj_Sotala · 2024-01-23T14:05:40.986Z · comments (2)

[link] AI Rights for Human Safety
Simon Goldstein (simon-goldstein) · 2024-08-01T23:01:07.252Z · comments (6)

A starting point for making sense of task structure (in machine learning)
Kaarel (kh) · 2024-02-24T01:51:49.227Z · comments (2)

Things Solenoid Narrates
Solenoid_Entity · 2024-04-12T23:57:16.169Z · comments (2)

AI #72: Denying the Future
Zvi · 2024-07-11T15:00:05.865Z · comments (8)

AI #80: Never Have I Ever
Zvi · 2024-09-10T17:50:08.074Z · comments (20)

[link] How people stopped dying from diarrhea so much (& other life-saving decisions)
Writer · 2024-03-16T16:00:47.830Z · comments (0)

[link] Rational Animations' intro to mechanistic interpretability
Writer · 2024-06-14T16:10:57.015Z · comments (1)

[link] AlphaGeometry: An Olympiad-level AI system for geometry
alyssavance · 2024-01-17T17:17:30.913Z · comments (9)

Startup Roundup #2
Zvi · 2024-08-06T13:30:06.554Z · comments (0)

Monthly Roundup #18: May 2024
Zvi · 2024-05-13T12:30:04.863Z · comments (10)

Dating Roundup #3: Third Time’s the Charm
Zvi · 2024-05-08T13:30:03.232Z · comments (27)

Principled Satisficing To Avoid Goodhart
JenniferRM · 2024-08-16T19:05:27.204Z · comments (2)

[link] Against Student Debt Cancellation From All Sides of the Political Compass
Maxwell Tabarrok (maxwell-tabarrok) · 2024-05-13T14:55:57.525Z · comments (16)

Apply to LASR Labs: a London-based technical AI safety research programme
Erin Robertson · 2024-04-09T17:34:06.847Z · comments (1)

Higher-Order Forecasts
ozziegooen · 2024-05-22T21:49:42.802Z · comments (1)

In defense of technological unemployment as the main AI concern
tailcalled · 2024-08-27T17:58:01.992Z · comments (36)

AI #60: Oh the Humanity
Zvi · 2024-04-18T14:10:02.281Z · comments (7)

Announcing Atlas Computing
miyazono · 2024-04-11T15:56:31.241Z · comments (4)

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
Rubi J. Hudson (Rubi) · 2024-07-16T22:44:17.128Z · comments (27)

[link] Open Sourcing Metaculus
ChristianWilliams · 2024-07-02T22:30:01.339Z · comments (0)

What does davidad want from «boundaries»?
Chipmonk · 2024-02-06T17:45:42.348Z · comments (1)

D&D.Sci Long War: Defender of Data-mocracy Evaluation & Ruleset
aphyer · 2024-05-14T03:35:10.586Z · comments (3)

On the Contrary, Steelmanning Is Normal; ITT-Passing Is Niche
Zack_M_Davis · 2024-01-09T23:12:20.349Z · comments (31)

Case Study: Interpreting, Manipulating, and Controlling CLIP With Sparse Autoencoders
Gytis Daujotas (gytis-daujotas) · 2024-08-01T21:08:38.800Z · comments (6)

[link] Level up your spreadsheeting
angelinahli · 2024-05-25T14:57:19.730Z · comments (11)

[question] "Deception Genre" What Books are like Project Lawful?
Double · 2024-08-28T17:19:52.172Z · answers+comments (20)

Start an Upper-Room UV Installation Company?
jefftk (jkaufman) · 2024-10-19T02:00:10.691Z · comments (9)

Economics Roundup #3
Zvi · 2024-09-10T13:50:06.955Z · comments (9)

On Trust
johnswentworth · 2023-12-06T19:19:07.680Z · comments (26)

An Introduction to AI Sandbagging
Teun van der Weij (teun-van-der-weij) · 2024-04-26T13:40:00.126Z · comments (10)

AI #38: Let’s Make a Deal
Zvi · 2023-11-16T19:50:05.442Z · comments (2)

[link] LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery (arjun-panickssery) · 2024-04-17T21:09:12.007Z · comments (1)

[link] Fluent dreaming for language models (AI interpretability method)
tbenthompson (ben-thompson) · 2024-02-06T06:02:59.296Z · comments (4)

Userscript to always show LW comments in context vs at the top
Vlad Sitalo (harcisis) · 2023-11-21T17:53:30.418Z · comments (8)

[link] Non-alignment project ideas for making transformative AI go well
Lukas Finnveden (Lanrian) · 2024-01-04T07:23:13.658Z · comments (1)

D&D.Sci Long War: Defender of Data-mocracy
aphyer · 2024-04-26T22:30:15.780Z · comments (20)

Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems
Sonia Joseph (redhat) · 2024-03-13T17:09:17.027Z · comments (13)

When Does Altruism Strengthen Altruism?
jefftk (jkaufman) · 2024-01-21T18:50:05.424Z · comments (2)

Back to Basics: Truth is Unitary
lsusr · 2024-03-29T21:10:33.399Z · comments (13)

New intro textbook on AIXI
Alex_Altair · 2024-05-11T18:18:50.945Z · comments (8)

Auditing failures vs concentrated failures
ryan_greenblatt · 2023-12-11T02:47:35.703Z · comments (0)

ProLU: A Nonlinearity for Sparse Autoencoders
Glen Taggart · 2024-04-23T14:09:21.592Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

esben-kran on College technical AI safety hackathon retrospective - Georgia Tech

Super cool work Yixiong - we were impressed by your professionalism in this process despite working within another group's whims on this one. Some other observations from our side that may be relevant for other folks hosting hackathons:
- Prepare starter materials: For example, for some of our early interpretability hackathons, we built a full resource base (Github) with videos, Colabs, and much more (some of it with Neel Nanda, big appreciation for his efforts in making interp more available). Our philosophy for the starter materials are: "If a participant can make a submission-worthy project by maximum cloning your repo and typing two commands or simply walk through a Google Colab, this is the ideal starter code." This means that with only small adjustments, they'll be able to make an original project. We rarely if ever see this exploited, i.e. "template code as submission" because they're able to copy-paste things around for a really strong research project.
- Make sure what they should submit is super clear: Making a really nice template goes a long way to make a submission super clear for participants. An example can be seen in our MASEC hackathon: Docs and page. If someone can just receive your submission template and know everything they need to know to submit a great project, that is really good since they'll be spending most of their time inside of that document.
- Make sure judging criteria are really good: People will use your judging criteria to determine what to prioritize in their project. This is extremely valuable for you to get right. For example, we usually use a variation on the three criteria: 1) Topic advancement, 2) AI safety impact, and 3) quality / reproducibility. A recent example was the Agent Security Hackathon:

> 1. Agent safety: Does the project move the field of agent safety forward? After reading this, do we know more about how to detect dangerous agents, protect against dangerous agents, or build safer agents than before?
> 2. AI safety: Does the project solve a concrete problem in AI safety? If this project is fully realized, would we expect the world with superintelligence to be a safer (even marginally) than yesterday?
> 3. Methodology: Is the project well-executed and is the code available so we can review it? Do we expect the results to generalize beyond the specific case(s) presented in the submission?

- Make the resources and ideas available early: As Yixiong mentions, it's really valuable for people not to be confused. If they know exactly what report format they'll submit, which idea they'll work on, and who they'll work with, this is a great way to ensure that the 2-3 days of hacking are an incredibly efficient use of their time.
- Matching people by ideas trumps by background: We've tried various ways to match individuals who don't have teams. The absolute best system we've found is to get people to brainstorm before the hackathon, share their ideas, and organize teams online. We also host team matching sessions which consist of fun-fact-intros and otherwise just discusses specific research ideas.
- Don't make it longer than a weekend: If you host a hackathon and make it longer than a weekend, most people who cannot attend outside that weekend will avoid participating because they'll feel that the ones who can participate more than the weekend can spend their weekdays to win the grand prize. Additionally, a very counter-intuitive thing happens where if you give people three weeks, they'll actually spend much less time on it than if you just give them a weekend. This can depend on the prizes or outcome rewards, of course, but is a really predictable effect, in our experience.
- Don't make it shorter than two days: Depending on your goal, one day will never be enough to create an original project. Our aims are original pilot research papers that can stand on their own and the few one-day events we've hosted have never worked very well, except for brainstorming. Often, participants won't even have any functional code or any ideas on the Sunday morning of the event but by the submission deadline have a really high quality project that wins the top prize. This seems to happen due to this very concrete exploration of ideas that happens in the IDE and on the internet where some are discarded and nothing promising comes up before 11am Sunday.

And as Yixiong mentions, we have more resources on this along with an official chapter network (besides volunteer locations) at https://www.apartresearch.com/sprints/locations. You're welcome to get in touch if you're interested in hosting at sprints@apartresearch.com.

COI: One of our researchers hosted a cyber-evals workshop at Yixiong's AI safety track.

elityre on Lao Mein's Shortform

You maybe right. Maybe the top talent wouldn't have gotten on board with that mission, and so it wouldn't have gotten top talent.

I bet Illya would have been in for that mission, and I think a surprisingly large number of other top researchers might have been in for it as well. Obviously we'll never know.

And I think if the founders are committed to a mission, and they reaffirm their commitment in every meeting, they can go surprisingly far in making in the culture of an org.

martin-randall on My AI Model Delta Compared To Yudkowsky

The concept of marriage depends on my internals in that a different human might disagree about whether a couple is married, based on the relative weight they place on religious, legal, traditional, and common law conceptions of marriage. For example, after a Catholic annulment and a legal divorce, a Catholic priest might say that two people were never married, whereas I would say that they were. Similarly, I might say that two men are married to each other, and someone else might say that this is impossible. How quickly those arguments have faded away! I don't think someone would use the same example ten years ago.

abramdemski on o1 is a bad idea

Assuming the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all, I do think that the system won't work well enough to be relevant, since it requires way too much time training, and plausibly way too much compute depending on how sparse the feedback actually is.

Ah, I wasn't thinking "sparse" here meant anywhere near that sparse. I thought your dense-vs-sparse was doing something like contrasting RLHF (very dense, basically no instrumental convergence) with chess (very sparse, plenty of instrumental convergence).

I still think o1 is moving towards chess on this spectrum.

martin-randall on My AI Model Delta Compared To Yudkowsky

A potential big Model Delta in this conversation is between Yudkowsky-2022 and Yudkowsky-2024. From List of Lethalities [LW · GW]:

The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

Vs the parent comment:

I think that the AI's internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn't surprise me to find distinct thoughts in there about electrons. As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment. As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences -- not just that the AI has a different concept boundary around "easy to understand", say, but that it maybe doesn't have any such internal notion as "easy to understand" at all, because easiness isn't in the environment and the AI doesn't have any such thing as "effort". Maybe it's got categories around yieldingness to seven different categories of methods, and/or some general notion of "can predict at all / can't predict at all", but no general notion that maps onto human "easy to understand" -- though "easy to understand" is plausibly general-enough that I wouldn't be unsurprised to find a mapping after all.

Yudkowsky is "not particularly happy" with List of Lethalities, and this comment was made a day after the opening post, so neither quote should be considered a perfect expression of Yudkowsky's belief. In particular the second quote is more epistemically modest, which might be because it is part of a conversation rather than a self-described "individual rant". Still, the differences are stark. Is the AI utterly, incredibly alien "on a staggering scale", or does the AI have "noticeable alignments to human ontology"? Are the differences pervasive with "nothing that would translate well", or does it depend on whether the concepts are "purely predictive", about "affordances and actions", or have "reflective aspects"?

The second quote is also less lethal. Human-to-human comparisons seem instructive. A deaf human will have thoughts about electrons, but their internal ontology around affordances and actions will be less aligned. Someone like Eliezer Yudkwosky has the skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment, whereas I can't do that because I project the category boundary onto the environment. Someone with dissociative identities may not have a general notion that maps onto my "myself". Someone who is enlightened may not have a general notion that maps onto my "I want". And so forth.

Regardless, different ontologies is still a clear risk factor. The second quote still modestly allows the possibility of a mind so utterly alien that it doesn't have thoughts about electrons. And there are 42 other lethalities in the list. Security mindset says that risk factors can combine in unexpected ways and kill you.

I'm not sure if this is an update from Yudkowsky-2022 to Yudkowsky-2024. I might expect an update to be flagged as such (eg "I now think that..." instead of "I think that..."). But Yudkowsky said elsewhere that he has made some positive updates. I'm curious if this is one of them.

abramdemski on 5 ways to improve CoT faithfulness

Async is fine with me. Please feel encouraged to start it.

daniel-kokotajlo on 5 ways to improve CoT faithfulness

Potentially... would it be async or would we need to coordinate on a time?

abramdemski on 5 ways to improve CoT faithfulness

My current epistemic state is pretty "I'm not sure at all, it's a complicated subject" fwiw. I'll try to generate some responses anyway.

make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn't train against the CoT basically at all with the baseline proposal? (Similar to o1.)

Sure. Therefore, you can't check for deceptive behavior using the CoT when you're trying to train against deceptive behavior. Therefore, Hubinger et al's conclusions about how deceptive behavior makes the system harder to train still hold even if you sorta get a transparency tool out of this.

You can still inspect the underlying CoT to know if you have big issues.

If other companies go in this direction like OpenAI has, then they'll probably hide the CoT from the end-users for similar reasons, so you can't say that the upside is that you, the end user, get to see whether the model is deliberately deceiving you with respect to specific questions.

So you probably mean the upside is that the companies can see this info in the CoT, to check whether models are generally behaving in a deceptive way (similar to what we saw in the o1 report).

Sure, but you're making a commitment not to use this in training, so what are you doing with this information? If the model is deceptive, you just throw it away (or, more realistically, make excuses about how the deception you observed isn't so bad [LW · GW]). But your version of the proposal deliberately trains the model to be deceptive. So yeah, you can look, but what's the point? You're checking whether a coin you rigged to land on heads in fact landed on heads.

(My guess is that you have this concern with the baseline protocol and thus don't like moving further in this direction.)

Yeah, you're right about that. But there is a difference. I can think of the original proposal as a "canary in a coal mine": you hope that the model doesn't learn deceptive behavior (you plan to chuck it out if you detect any), but you still follow Daniel's division of labor & refuse to do any alignment training on the CoT so that you have a chance of detecting deception if it occurs.

On the other hand, your version of Daniel's protocol loses this property. Metaphorically, it's like killing the canary before you go into the coal mine. If the canary was supposed to be an early warning to get out of there, you've completely defeated the point, and for what? (This is the part I'm unclear about, and why I asked my original question to you, rather than Daniel.)

Deceptive behavior is hard to remove once it creeps in. This is nasty stuff. A technique for getting deceptive behavior to reveal itself in the CoT would be wonderfully useful if you could then use that in the training signal to remove the deceptive behavior, but the proposal here relies on not doing that.

Basically, if there is a basin of attraction for corrigibility [AF · GW], it seems to me that deceptive behavior is the opposite of that. It puts you outside the basin of attraction. It means you're on-coarse for Evan Hubinger's story about how malign inner optimizers develop. It means you're on-coarse for the Bad Ending rather than the Good Ending. And for what?

My take (very much siding with Evan Hubinger here) is that the target should be keeping well away from deceptive behavior throughout training, so that there's never that kernel of skill that could grow into a malign inner optimizer.

shankar-sivarajan on Making a conservative case for alignment

Or just capitalize the "C".

elityre on Lao Mein's Shortform

Also, Sam Altman is a pretty impressive guy. I wonder what would have happened if he had decided to try to stop humanity from building AGI, instead of trying to be the one to do it instead of google.