LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] "Can AI Scaling Continue Through 2030?", Epoch AI (yes)
gwern · 2024-08-24T01:40:32.929Z · comments (4)

Circuits in Superposition: Compressing many small neural networks into one
Lucius Bushnaq (Lblack) · 2024-10-14T13:06:14.596Z · comments (7)

How I started believing religion might actually matter for rationality and moral philosophy
zhukeepa · 2024-08-23T17:40:47.341Z · comments (41)

BIG-Bench Canary Contamination in GPT-4
Jozdien · 2024-10-22T15:40:48.166Z · comments (13)

[link] Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded
garrison · 2024-10-23T23:40:57.180Z · comments (1)

A bird's eye view of ARC's research
Jacob_Hilton · 2024-10-23T15:50:06.123Z · comments (12)

[link] My Number 1 Epistemology Book Recommendation: Inventing Temperature
adamShimi · 2024-09-08T14:30:40.456Z · comments (18)

Why I funded PIBBSS
Ryan Kidd (ryankidd44) · 2024-09-15T19:56:33.018Z · comments (21)

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
Thomas Kwa (thomas-kwa) · 2024-11-06T23:01:48.992Z · comments (35)

[link] A primer on the current state of longevity research
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-22T17:14:57.990Z · comments (6)

Please stop using mediocre AI art in your posts
Raemon · 2024-08-25T00:13:52.890Z · comments (24)

[link] Perplexity wins my AI race
Elizabeth (pktechgirl) · 2024-08-24T19:20:10.859Z · comments (12)

Backdoors as an analogy for deceptive alignment
Jacob_Hilton · 2024-09-06T15:30:06.172Z · comments (2)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (20)

What happens if you present 500 people with an argument that AI is risky?
KatjaGrace · 2024-09-04T16:40:03.562Z · comments (7)

LLM Applications I Want To See
sarahconstantin · 2024-08-19T21:10:03.101Z · comments (5)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (38)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

Scissors Statements for President?
AnnaSalamon · 2024-11-06T10:38:21.230Z · comments (31)

[link] Advice for journalists
Nathan Young · 2024-10-07T16:46:40.929Z · comments (53)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (4)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (10)

Neutrality
sarahconstantin · 2024-11-13T23:10:05.469Z · comments (0)

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (14)

[link] the Giga Press was a mistake
bhauth · 2024-08-21T04:51:24.150Z · comments (26)

2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)

[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (18)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (32)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (5)

SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)

You can, in fact, bamboozle an unaligned AI into sparing your life
David Matolcsi (matolcsid) · 2024-09-29T16:59:43.942Z · comments (171)

[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (33)

Information vs Assurance
johnswentworth · 2024-10-20T23:16:25.762Z · comments (7)

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (10)

Defining alignment research
Richard_Ngo (ricraz) · 2024-08-19T20:42:29.279Z · comments (23)

Science advances one funeral at a time
Cameron Berg (cameron-berg) · 2024-11-01T23:06:19.381Z · comments (9)

Making a conservative case for alignment
Cameron Berg (cameron-berg) · 2024-11-15T18:55:40.864Z · comments (7)

Three Notions of "Power"
johnswentworth · 2024-10-30T06:10:08.326Z · comments (43)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (5)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

[link] Self-Help Corner: Loop Detection
adamShimi · 2024-10-02T08:33:23.487Z · comments (6)

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
Eric Neyman (UnexpectedValues) · 2024-10-07T19:29:29.033Z · comments (2)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (40)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (20)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (51)

5 homegrown EA projects, seeking small donors
Austin Chen (austin-chen) · 2024-10-28T23:24:25.745Z · comments (4)

Newsom Vetoes SB 1047
Zvi · 2024-10-01T12:20:06.127Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lukehmiles on lukehmiles's Shortform

I do think I could put a good team together and make decent contributions quickly

lukehmiles on lukehmiles's Shortform

I can only find capabilities jobs right now. I would be interested in starting a tiny applied research org or something. How hard is it to get funding for that? I don't have a strong relevant public record, but I did quite a lot of work at METR and elsewhere.

ryan_greenblatt on 5 ways to improve CoT faithfulness

On the other hand, your version of Daniel's protocol loses this property.

I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.

And, these should clearly be separate models such that the training of one doesn't generalize to the other.

So, making the Face more deceptive doesn't kill the canary? The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).

Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?

(It might also be possible to reduce the extent to which the initial Face has control over the world. E.g., you can initialize one Face to use for the first part of training (including the proposal I was mentioning) and then train another later on to imitate the earlier Face on a subpart of the distribution you inspect with the hope that it doesn't inherit learned misalignment. You could also deploy a different Face than the one you use for training if you aren't doing online training.)

yix on College technical AI safety hackathon retrospective - Georgia Tech

Thanks again Esben for collaborating with us! Can confidently say that the above is super valuable advice for any AI safety hackathon organizers, they're consistent with our experiences.

In the context of a college campus hackathon, I'd especially stress focus on preparing starter materials and making submission requirements clear early on!

nosignalnonoise on Heresies in the Shadow of the Sequences

I like what you're doing, but I feel like the heresies you propose are too tame.

Here are some more radical heresies to consider:

Most people are far more bottlenecked on some combination of akrasia and prospective memory, not on the accuracy of their models of the world. Rationalists in particular would be better off devoting effort to actually doing the obvious things than to understanding the world better.
Self deception is very instrumentally useful a large fraction of real world situations we find ourselves in, and we should use more of it.
1. Mormons seem to be especially good at coordinating on good lifestyle choices, so we should all consider becoming Mormon.
Among groups of 10+ people, it's usually more useful to get everyone all working on implementing the same plan than it is to come up with the best plan.
Intelligence (of the sort measured by exams and IQ tests) is only moderately important to success.

elityre on Lao Mein's Shortform

But he helped found OpenAI, and recently founded another AI company.

I think Elon's strategy of "telling the world not to build AGI, and then going to start another AGI company himself" is much less dumb / ethical fraught [LW(p) · GW(p)], than people often credit.

myles-h on What are Emotions?

"Values" happen to be a thing possessed by thinking entities

What happens then when a non-thinking thing feels happy? Is that happiness valued? To whom? Or do you think this is impossible?

I can imagine it possible for a fetus in the womb without any thoughts, sense of self, or an ability to move, to still be capable of feeling happiness. Now try to imagine a hypothetical person with a severe mental disability preventing them having any cohesive thoughts, sense of self, or an ability to move. Could they still feel happiness? What happens when the dopamine receptors get triggered?

It is my hypothesis that the mechanism by which emotions are felt does not require a "thinking" agent. This could be false and I now see how this is an assumption which many of my arguments rely on. Thank you for catching that.

It just seems so clear to me. When I feel pain or pleasure, I don't need to "think" about it for the emotion to be felt. I just immediately feel the pain or pleasure.

Anyway, if you assume that it is possible for a non-thinker to still be a feeler, then there is nothing logically inconceivable about a hypothetical happy rock. Then if you also say that happiness is good, and that good implies value, one must ask, who or what is valuing the happiness? The rock? The universe?

Ok maybe not "the universe" as to mean the collection of all objects within the universe. I'm more trying to say "the fabric of reality". Like there must be some physical process by which happiness is valued. Maybe a dimension by which emotional value is expressed?

I also suspect that some of the things you're calling "material terminal values" are actually better modeled as instrumental

You are partly correct about this. When I said I terminally value the making of kinetic sculptures, I was definitely making a simplification. I don't value the making of all kinetic sculptures, and I also value the making of things which aren't kinetic sculptures. I don't, however, do it because I think it is "fun". I can't formally define what the actual material terminal goal is but it is something more along the lines of, "something that is challenging, and requires a certain kind of problem solving, where the solution is beautiful in some way".

Anyway, it is often the case that the making of kinetic sculptures fits this description.

It is not true that I "simply enjoy the process of building them". Whatever the actual definition of my goal is, I don't want it because it is an instrumental goal to some emotion. This precisely what I am defining a material terminal goal to be. Any terminal goal which is not an emotion.

I also think you're calling something universal to humans when it really isn't.

I should have clarified this better. I am not saying the intensity or valence direction of emotions is universal. I am simply saying that the emotions, in general, are universally valued. Thank you for correcting me on the way masochists work. I didn't realize they were "genuinely wired differently". I just assumed they had some conflicting goal which made pain worth it. This doesn't break my argument however. I would say that the masochist is not feeling pain at that point. They would be feeling some other emotion for emotions are defined by the chemical and neural processes which make them happen. Similar to how my happiness and your happiness are not the same, but they are close enough to be grouped into a word. The key piece though is that regardless, as tslarm [LW · GW] says, "emotions are accompanied by (or identical with, depending on definitions) valenced qualia". They always have some value.

I agree that there are good reasons to value the feelings of others. I'm not sure the Ship of Theseus argument is one of them, really, but I'm also not sure I fully understood your point there.

Ahhh, yeah sorry that wasn't the clearest, I was making the point that one should value the emotions of more than just other humans. Like pigs, cats, dogs, or feely blobs.

elityre on Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.

Thinking about this post for a bit shifted my view of Elon Musk a bit. He gets flack for calling for an AI pause, and then going and starting an AGI lab, and I now think that's unfair.

I think his overall strategic takes are harmful, but I do credit him with being basically the only would-be AGI-builder who seems to me to be engaged in a reformative hypocrisy strategy. For one thing, it sounds like he went out of his way to try to get AI regulated (talking to congress, talking to the governors), and supported SB-1047.

I think it's actually not that unreasonable to shout "Yo! This is dangerous! This should be regulated, and controlled democratically!", see that that's not happening, and then go and try do it in a way that you think is better.

That seems like possibly an example of "follower-conditional leadership." Taking real action to shift to the better equilibrium, failing, and then going back to the dominant strategy given the inadequate equilibrium that exists.

Obviously he has different beliefs than I do, and than my culture does, about what is required for a good outcome. I think he's still causing vast harms, but I think he doesn't deserve the eye-roll for founding another AGI lab after calling for everyone to stop.

esben-kran on College technical AI safety hackathon retrospective - Georgia Tech

Super cool work Yixiong - we were impressed by your professionalism in this process despite working within another group's whims on this one. Some other observations from our side that may be relevant for other folks hosting hackathons:
- Prepare starter materials: For example, for some of our early interpretability hackathons, we built a full resource base (Github) with videos, Colabs, and much more (some of it with Neel Nanda, big appreciation for his efforts in making interp more available). Our philosophy for the starter materials are: "If a participant can make a submission-worthy project by maximum cloning your repo and typing two commands or simply walk through a Google Colab, this is the ideal starter code." This means that with only small adjustments, they'll be able to make an original project. We rarely if ever see this exploited, i.e. "template code as submission" because they're able to copy-paste things around for a really strong research project.
- Make sure what they should submit is super clear: Making a really nice template goes a long way to make a submission super clear for participants. An example can be seen in our MASEC hackathon: Docs and page. If someone can just receive your submission template and know everything they need to know to submit a great project, that is really good since they'll be spending most of their time inside of that document.
- Make sure judging criteria are really good: People will use your judging criteria to determine what to prioritize in their project. This is extremely valuable for you to get right. For example, we usually use a variation on the three criteria: 1) Topic advancement, 2) AI safety impact, and 3) quality / reproducibility. A recent example was the Agent Security Hackathon:

> 1. Agent safety: Does the project move the field of agent safety forward? After reading this, do we know more about how to detect dangerous agents, protect against dangerous agents, or build safer agents than before?
> 2. AI safety: Does the project solve a concrete problem in AI safety? If this project is fully realized, would we expect the world with superintelligence to be a safer (even marginally) than yesterday?
> 3. Methodology: Is the project well-executed and is the code available so we can review it? Do we expect the results to generalize beyond the specific case(s) presented in the submission?

- Make the resources and ideas available early: As Yixiong mentions, it's really valuable for people not to be confused. If they know exactly what report format they'll submit, which idea they'll work on, and who they'll work with, this is a great way to ensure that the 2-3 days of hacking are an incredibly efficient use of their time.
- Matching people by ideas trumps by background: We've tried various ways to match individuals who don't have teams. The absolute best system we've found is to get people to brainstorm before the hackathon, share their ideas, and organize teams online. We also host team matching sessions which consist of fun-fact-intros and otherwise just discusses specific research ideas.
- Don't make it longer than a weekend: If you host a hackathon and make it longer than a weekend, most people who cannot attend outside that weekend will avoid participating because they'll feel that the ones who can participate more than the weekend can spend their weekdays to win the grand prize. Additionally, a very counter-intuitive thing happens where if you give people three weeks, they'll actually spend much less time on it than if you just give them a weekend. This can depend on the prizes or outcome rewards, of course, but is a really predictable effect, in our experience.
- Don't make it shorter than two days: Depending on your goal, one day will never be enough to create an original project. Our aims are original pilot research papers that can stand on their own and the few one-day events we've hosted have never worked very well, except for brainstorming. Often, participants won't even have any functional code or any ideas on the Sunday morning of the event but by the submission deadline have a really high quality project that wins the top prize. This seems to happen due to this very concrete exploration of ideas that happens in the IDE and on the internet where some are discarded and nothing promising comes up before 11am Sunday.

And as Yixiong mentions, we have more resources on this along with an official chapter network (besides volunteer locations) at https://www.apartresearch.com/sprints/locations. You're welcome to get in touch if you're interested in hosting at sprints@apartresearch.com.

COI: One of our researchers hosted a cyber-evals workshop at Yixiong's AI safety track.

elityre on Lao Mein's Shortform

You maybe right. Maybe the top talent wouldn't have gotten on board with that mission, and so it wouldn't have gotten top talent.

I bet Illya would have been in for that mission, and I think a surprisingly large number of other top researchers might have been in for it as well. Obviously we'll never know.

And I think if the founders are committed to a mission, and they reaffirm their commitment in every meeting, they can go surprisingly far in making in the culture of an org.