LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

One person's worth of mental energy for AI doom aversion jobs. What should I do?
Lorec · 2024-08-26T01:29:01.700Z · comments (16)

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Winnie Yang (winnie-yang) · 2024-08-28T08:41:38.967Z · comments (2)

[link] Is Redistributive Taxation Justifiable? Part 1: Do the Rich Deserve their Wealth?
Alexander de Vries (alexander-de-vries) · 2024-09-05T10:23:08.958Z · comments (20)

[question] What are some good ways to form opinions on controversial subjects in the current and upcoming era?
notfnofn · 2024-10-27T14:33:53.960Z · answers+comments (21)

[link] Spherical cow
dkl9 · 2024-11-11T03:10:27.788Z · comments (0)

Sequence overview: Welfare and moral weights
MichaelStJules · 2024-08-15T04:22:32.567Z · comments (0)

[link] Checking public figures on whether they "answered the question" quick analysis from Harris/Trump debate, and a proposal
david reinstein (david-reinstein) · 2024-09-11T20:25:27.845Z · comments (4)

Of Birds and Bees
RussellThor · 2024-09-30T10:52:15.069Z · comments (9)

Fake Blog Posts as a Problem Solving Device
silentbob · 2024-08-31T09:22:54.513Z · comments (0)

[link] [Linkpost] Hawkish nationalism vs international AI power and benefit sharing
jakub_krys (kryjak) · 2024-10-18T18:13:19.425Z · comments (5)

Denver USA - ACX Meetups Everywhere Fall 2024
Eneasz · 2024-08-29T18:40:53.332Z · comments (0)

[question] Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?
Noosphere89 (sharmake-farah) · 2024-08-30T15:12:28.823Z · answers+comments (11)

[question] What makes one a "rationalist"?
mathyouf · 2024-10-08T20:25:21.812Z · answers+comments (5)

The Personal Implications of AGI Realism
xizneb · 2024-10-20T16:43:37.870Z · comments (7)

[question] somebody explain the word "epistemic" to me
KvmanThinking (avery-liu) · 2024-10-28T16:40:24.275Z · answers+comments (8)

[link] Validating / finding alignment-relevant concepts using neural data
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-20T21:12:49.267Z · comments (0)

Quantitative Trading Bootcamp [Nov 6-10]
Ricki Heicklen (bayesshammai) · 2024-10-28T18:39:58.480Z · comments (0)

[question] If I ask an LLM to think step by step, how big are the steps?
ryan_b · 2024-09-13T20:30:50.558Z · answers+comments (1)

Piling bounded arguments
momom2 (amaury-lorin) · 2024-09-19T22:27:41.534Z · comments (0)

Enhancing Mathematical Modeling with LLMs: Goals, Challenges, and Evaluations
ozziegooen · 2024-10-28T21:44:42.352Z · comments (0)

[link] October 2024 Progress in Guaranteed Safe AI
Quinn (quinn-dougherty) · 2024-10-28T23:34:51.689Z · comments (0)

[link] Taking nonlogical concepts seriously
Kris Brown (kris-brown) · 2024-10-15T18:16:01.226Z · comments (5)

[link] Boons and banes
dkl9 · 2024-09-23T06:18:38.335Z · comments (0)

A Brief Explanation of AI Control
Aaron_Scher · 2024-10-22T07:00:56.954Z · comments (1)

A brief theory of why we think things are good or bad
David Johnston (david-johnston) · 2024-10-20T20:31:26.309Z · comments (10)

Join my new subscriber chat
sarahconstantin · 2024-11-06T02:30:11.059Z · comments (0)

Moral Trade, Impact Distributions and Large Worlds
Larks · 2024-09-20T03:45:56.273Z · comments (0)

[link] Thinking LLMs: General Instruction Following with Thought Generation
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-10-15T09:21:22.583Z · comments (0)

[link] Consciousness As Recursive Reflections
Gunnar_Zarncke · 2024-10-05T20:00:53.053Z · comments (3)

[link] Contra Yudkowsky on 2-4-6 Game Difficulty Explanations
Josh Hickman (josh-hickman) · 2024-09-08T16:13:33.187Z · comments (1)

A gentle introduction to sparse autoencoders
Nick Jiang (nick-jiang) · 2024-09-02T18:11:47.086Z · comments (0)

2025 Q1 Pivotal Research Fellowship (Technical & Policy)
Tobias H (clearthis) · 2024-11-12T10:56:24.858Z · comments (0)

[link] Redundant Attention Heads in Large Language Models For In Context Learning
skunnavakkam · 2024-09-01T20:08:48.963Z · comments (0)

[link] Optimising under arbitrarily many constraint equations
dkl9 · 2024-09-12T14:59:28.475Z · comments (0)

Increasing the Span of the Set of Ideas
Jeffrey Heninger (jeffrey-heninger) · 2024-09-13T15:52:39.132Z · comments (1)

Halifax Canada - ACX Meetups Everywhere Fall 2024
interstice · 2024-08-29T18:39:12.490Z · comments (0)

Forever Leaders
Justice Howard (justice-howard) · 2024-09-14T20:55:39.095Z · comments (9)

Introducing Kairos: a new AI safety fieldbuilding organization (the new home for SPAR and FSP)
agucova · 2024-10-25T21:59:08.782Z · comments (0)

[question] How to cite LessWrong as an academic source?
PhilosophicalSoul (LiamLaw) · 2024-11-06T08:28:26.309Z · answers+comments (6)

[link] SCP Foundation - Anti memetic Division Hub
landscape_kiwi · 2024-09-15T13:40:52.691Z · comments (1)

[question] why won't this alignment plan work?
KvmanThinking (avery-liu) · 2024-10-10T15:44:59.450Z · answers+comments (7)

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents
Alejandro Aristizabal (alejandro-aristizabal) · 2024-09-29T00:32:42.161Z · comments (0)

Thirty random thoughts about AI alignment
Lysandre Terrisse · 2024-09-15T16:24:10.572Z · comments (1)

Understanding Hidden Computations in Chain-of-Thought Reasoning
rokosbasilisk · 2024-08-24T16:35:03.907Z · comments (1)

Thoughts on Evo-Bio Math and Mesa-Optimization: Maybe We Need To Think Harder About "Relative" Fitness?
Lorec · 2024-09-28T14:07:42.412Z · comments (6)

[link] Metaculus's 'Minitaculus' Experiments — Collaborate With Us
ChristianWilliams · 2024-08-26T20:44:32.125Z · comments (0)

[question] Can subjunctive dependence emerge from a simplicity prior?
Daniel C (harper-owen) · 2024-09-16T12:39:35.543Z · answers+comments (0)

'Chat with impactful research & evaluations' (Unjournal NotebookLMs)
david reinstein (david-reinstein) · 2024-09-28T00:32:16.845Z · comments (0)

Avoiding jailbreaks by discouraging their representation in activation space
Guido Bergman · 2024-09-27T17:49:20.785Z · comments (2)

Food, Prison & Exotic Animals: Sparse Autoencoders Detect 6.5x Performing Youtube Thumbnails
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-09-17T03:52:43.269Z · comments (2)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

habryka4 on lukehmiles's Shortform

What do you mean by "applied research org"? Like, applied alignment research?

lukehmiles on lukehmiles's Shortform

I do think I could put a good team together and make decent contributions quickly

lukehmiles on lukehmiles's Shortform

I can only find capabilities jobs right now. I would be interested in starting a tiny applied research org or something. How hard is it to get funding for that? I don't have a strong relevant public record, but I did quite a lot of work at METR and elsewhere.

ryan_greenblatt on 5 ways to improve CoT faithfulness

On the other hand, your version of Daniel's protocol loses this property.

I'm talking about training only the Face, not training the policy (shoggoth) at all with the proposal I'm imagining.

And, these should clearly be separate models such that the training of one doesn't generalize to the other.

So, making the Face more deceptive doesn't kill the canary? The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).

Maybe your concern is that it is plausible that the policy won't be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn't smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?

(It might also be possible to reduce the extent to which the initial Face has control over the world. E.g., you can initialize one Face to use for the first part of training (including the proposal I was mentioning) and then train another later on to imitate the earlier Face on a subpart of the distribution you inspect with the hope that it doesn't inherit learned misalignment. You could also deploy a different Face than the one you use for training if you aren't doing online training.)

yix on College technical AI safety hackathon retrospective - Georgia Tech

Thanks again Esben for collaborating with us! Can confidently say that the above is super valuable advice for any AI safety hackathon organizers, they're consistent with our experiences.

In the context of a college campus hackathon, I'd especially stress focus on preparing starter materials and making submission requirements clear early on!

nosignalnonoise on Heresies in the Shadow of the Sequences

I like what you're doing, but I feel like the heresies you propose are too tame.

Here are some more radical heresies to consider:

Most people are far more bottlenecked on some combination of akrasia and prospective memory, not on the accuracy of their models of the world. Rationalists in particular would be better off devoting effort to actually doing the obvious things than to understanding the world better.
Self deception is very instrumentally useful a large fraction of real world situations we find ourselves in, and we should use more of it.
1. Mormons seem to be especially good at coordinating on good lifestyle choices, so we should all consider becoming Mormon.
Among groups of 10+ people, it's usually more useful to get everyone all working on implementing the same plan than it is to come up with the best plan.
Intelligence (of the sort measured by exams and IQ tests) is only moderately important to success.

elityre on Lao Mein's Shortform

But he helped found OpenAI, and recently founded another AI company.

I think Elon's strategy of "telling the world not to build AGI, and then going to start another AGI company himself" is much less dumb / ethical fraught [LW(p) · GW(p)], than people often credit.

myles-h on What are Emotions?

"Values" happen to be a thing possessed by thinking entities

What happens then when a non-thinking thing feels happy? Is that happiness valued? To whom? Or do you think this is impossible?

I can imagine it possible for a fetus in the womb without any thoughts, sense of self, or an ability to move, to still be capable of feeling happiness. Now try to imagine a hypothetical person with a severe mental disability preventing them having any cohesive thoughts, sense of self, or an ability to move. Could they still feel happiness? What happens when the dopamine receptors get triggered?

It is my hypothesis that the mechanism by which emotions are felt does not require a "thinking" agent. This could be false and I now see how this is an assumption which many of my arguments rely on. Thank you for catching that.

It just seems so clear to me. When I feel pain or pleasure, I don't need to "think" about it for the emotion to be felt. I just immediately feel the pain or pleasure.

Anyway, if you assume that it is possible for a non-thinker to still be a feeler, then there is nothing logically inconceivable about a hypothetical happy rock. Then if you also say that happiness is good, and that good implies value, one must ask, who or what is valuing the happiness? The rock? The universe?

Ok maybe not "the universe" as to mean the collection of all objects within the universe. I'm more trying to say "the fabric of reality". Like there must be some physical process by which happiness is valued. Maybe a dimension by which emotional value is expressed?

I also suspect that some of the things you're calling "material terminal values" are actually better modeled as instrumental

You are partly correct about this. When I said I terminally value the making of kinetic sculptures, I was definitely making a simplification. I don't value the making of all kinetic sculptures, and I also value the making of things which aren't kinetic sculptures. I don't, however, do it because I think it is "fun". I can't formally define what the actual material terminal goal is but it is something more along the lines of, "something that is challenging, and requires a certain kind of problem solving, where the solution is beautiful in some way".

Anyway, it is often the case that the making of kinetic sculptures fits this description.

It is not true that I "simply enjoy the process of building them". Whatever the actual definition of my goal is, I don't want it because it is an instrumental goal to some emotion. This precisely what I am defining a material terminal goal to be. Any terminal goal which is not an emotion.

I also think you're calling something universal to humans when it really isn't.

I should have clarified this better. I am not saying the intensity or valence direction of emotions is universal. I am simply saying that the emotions, in general, are universally valued. Thank you for correcting me on the way masochists work. I didn't realize they were "genuinely wired differently". I just assumed they had some conflicting goal which made pain worth it. This doesn't break my argument however. I would say that the masochist is not feeling pain at that point. They would be feeling some other emotion for emotions are defined by the chemical and neural processes which make them happen. Similar to how my happiness and your happiness are not the same, but they are close enough to be grouped into a word. The key piece though is that regardless, as tslarm [LW · GW] says, "emotions are accompanied by (or identical with, depending on definitions) valenced qualia". They always have some value.

I agree that there are good reasons to value the feelings of others. I'm not sure the Ship of Theseus argument is one of them, really, but I'm also not sure I fully understood your point there.

Ahhh, yeah sorry that wasn't the clearest, I was making the point that one should value the emotions of more than just other humans. Like pigs, cats, dogs, or feely blobs.

elityre on Reformative Hypocrisy, and Paying Close Enough Attention to Selectively Reward It.

Thinking about this post for a bit shifted my view of Elon Musk a bit. He gets flack for calling for an AI pause, and then going and starting an AGI lab, and I now think that's unfair.

I think his overall strategic takes are harmful, but I do credit him with being basically the only would-be AGI-builder who seems to me to be engaged in a reformative hypocrisy strategy. For one thing, it sounds like he went out of his way to try to get AI regulated (talking to congress, talking to the governors), and supported SB-1047.

I think it's actually not that unreasonable to shout "Yo! This is dangerous! This should be regulated, and controlled democratically!", see that that's not happening, and then go and try do it in a way that you think is better.

That seems like possibly an example of "follower-conditional leadership." Taking real action to shift to the better equilibrium, failing, and then going back to the dominant strategy given the inadequate equilibrium that exists.

Obviously he has different beliefs than I do, and than my culture does, about what is required for a good outcome. I think he's still causing vast harms, but I think he doesn't deserve the eye-roll for founding another AGI lab after calling for everyone to stop.

esben-kran on College technical AI safety hackathon retrospective - Georgia Tech

Super cool work Yixiong - we were impressed by your professionalism in this process despite working within another group's whims on this one. Some other observations from our side that may be relevant for other folks hosting hackathons:
- Prepare starter materials: For example, for some of our early interpretability hackathons, we built a full resource base (Github) with videos, Colabs, and much more (some of it with Neel Nanda, big appreciation for his efforts in making interp more available). Our philosophy for the starter materials are: "If a participant can make a submission-worthy project by maximum cloning your repo and typing two commands or simply walk through a Google Colab, this is the ideal starter code." This means that with only small adjustments, they'll be able to make an original project. We rarely if ever see this exploited, i.e. "template code as submission" because they're able to copy-paste things around for a really strong research project.
- Make sure what they should submit is super clear: Making a really nice template goes a long way to make a submission super clear for participants. An example can be seen in our MASEC hackathon: Docs and page. If someone can just receive your submission template and know everything they need to know to submit a great project, that is really good since they'll be spending most of their time inside of that document.
- Make sure judging criteria are really good: People will use your judging criteria to determine what to prioritize in their project. This is extremely valuable for you to get right. For example, we usually use a variation on the three criteria: 1) Topic advancement, 2) AI safety impact, and 3) quality / reproducibility. A recent example was the Agent Security Hackathon:

> 1. Agent safety: Does the project move the field of agent safety forward? After reading this, do we know more about how to detect dangerous agents, protect against dangerous agents, or build safer agents than before?
> 2. AI safety: Does the project solve a concrete problem in AI safety? If this project is fully realized, would we expect the world with superintelligence to be a safer (even marginally) than yesterday?
> 3. Methodology: Is the project well-executed and is the code available so we can review it? Do we expect the results to generalize beyond the specific case(s) presented in the submission?

- Make the resources and ideas available early: As Yixiong mentions, it's really valuable for people not to be confused. If they know exactly what report format they'll submit, which idea they'll work on, and who they'll work with, this is a great way to ensure that the 2-3 days of hacking are an incredibly efficient use of their time.
- Matching people by ideas trumps by background: We've tried various ways to match individuals who don't have teams. The absolute best system we've found is to get people to brainstorm before the hackathon, share their ideas, and organize teams online. We also host team matching sessions which consist of fun-fact-intros and otherwise just discusses specific research ideas.
- Don't make it longer than a weekend: If you host a hackathon and make it longer than a weekend, most people who cannot attend outside that weekend will avoid participating because they'll feel that the ones who can participate more than the weekend can spend their weekdays to win the grand prize. Additionally, a very counter-intuitive thing happens where if you give people three weeks, they'll actually spend much less time on it than if you just give them a weekend. This can depend on the prizes or outcome rewards, of course, but is a really predictable effect, in our experience.
- Don't make it shorter than two days: Depending on your goal, one day will never be enough to create an original project. Our aims are original pilot research papers that can stand on their own and the few one-day events we've hosted have never worked very well, except for brainstorming. Often, participants won't even have any functional code or any ideas on the Sunday morning of the event but by the submission deadline have a really high quality project that wins the top prize. This seems to happen due to this very concrete exploration of ideas that happens in the IDE and on the internet where some are discarded and nothing promising comes up before 11am Sunday.

And as Yixiong mentions, we have more resources on this along with an official chapter network (besides volunteer locations) at https://www.apartresearch.com/sprints/locations. You're welcome to get in touch if you're interested in hosting at sprints@apartresearch.com.

COI: One of our researchers hosted a cyber-evals workshop at Yixiong's AI safety track.