LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Keeping it (less than) real: Against ℶ₂ possible people or worlds
quiet_NaN · 2024-09-13T17:29:44.915Z · comments (0)

[question] Are UV-C Air purifiers so useful?
JohnBuridan · 2024-09-04T14:16:01.310Z · answers+comments (0)

Fix simple mistakes in ARC-AGI, etc.
Oleg Trott (oleg-trott) · 2024-07-09T17:46:50.364Z · comments (9)

Common Uses of "Acceptance"
Yi-Yang (yiyang) · 2024-07-26T11:18:30.719Z · comments (5)

[question] How bad would AI progress need to be for us to think general technological progress is also bad?
Jim Buhler (jim-buhler) · 2024-07-09T10:43:45.506Z · answers+comments (5)

The Other Existential Crisis
James Stephen Brown (james-brown) · 2024-09-21T01:16:38.011Z · comments (24)

AGI's Opposing Force
SimonBaars (simonbaars) · 2024-08-16T04:18:06.900Z · comments (2)

Virtue taxation
Dentosal (dentosal) · 2024-07-12T14:56:36.172Z · comments (1)

Trying to understand Hanson's Cultural Drift argument
Kemp (ethan-kemp) · 2024-07-22T20:20:32.734Z · comments (3)

Simultaneous Footbass and Footdrums II
jefftk (jkaufman) · 2024-08-11T23:50:01.982Z · comments (0)

Electric Mandola
jefftk (jkaufman) · 2024-09-21T13:40:04.772Z · comments (0)

how to rapidly assimilate new information
dhruvmethi · 2024-10-24T02:18:00.648Z · comments (3)

Derivative AT a discontinuity
Alok Singh (OldManNick) · 2024-10-24T02:48:24.573Z · comments (5)

Contra Musician Gender II
jefftk (jkaufman) · 2024-11-13T03:30:09.510Z · comments (0)

[link] Why do firms choose to be inefficient?
Nicholas D. (nicholas-d) · 2024-08-28T18:39:41.664Z · comments (4)

Will AI and Humanity Go to War?
Simon Goldstein (simon-goldstein) · 2024-10-01T06:35:22.374Z · comments (4)

[question] How tokenization influences prompting?
Boris Kashirin (boris-kashirin) · 2024-07-29T10:28:25.056Z · answers+comments (4)

An Interpretability Illusion from Population Statistics in Causal Analysis
Daniel Tan (dtch1997) · 2024-07-29T14:50:19.497Z · comments (3)

[link] Testing Genetic Engineering Detection with Spike-Ins
jefftk (jkaufman) · 2024-10-22T17:20:54.947Z · comments (0)

A Dialogue on Deceptive Alignment Risks
Rauno Arike (rauno-arike) · 2024-09-25T16:10:12.294Z · comments (0)

[link] AI Safety at the Frontier: Paper Highlights, July '24
gasteigerjo · 2024-08-05T13:00:46.028Z · comments (0)

[link] What is autonomy? Why boundaries are necessary.
Chipmonk · 2024-10-21T17:56:33.722Z · comments (1)

My covid-related beliefs and questions
Severin T. Seehrich (sts) · 2024-07-23T03:27:09.348Z · comments (0)

[link] Approval-Seeking ⇒ Playful Evaluation
Jonathan Moregård (JonathanMoregard) · 2024-08-28T21:03:51.244Z · comments (0)

[link] Can AI agents learn to be good?
Ram Rachum (ram@rachum.com) · 2024-08-29T14:20:04.336Z · comments (0)

Meta AI (FAIR) latest paper integrates system-1 and system-2 thinking into reasoning models.
happy friday (happy-friday) · 2024-10-24T16:54:15.721Z · comments (0)

The Geometric Importance of Side Payments
StrivingForLegibility · 2024-08-07T01:38:04.635Z · comments (4)

Thoughts On the Nature of Capability Elicitation via Fine-tuning
Theodore Chapman · 2024-10-15T08:39:19.909Z · comments (0)

[link] Contagious Beliefs—Simulating Political Alignment
James Stephen Brown (james-brown) · 2024-10-13T00:27:08.084Z · comments (0)

[link] Michael Streamlines on Buddhism
Chris_Leong · 2024-08-09T04:44:52.126Z · comments (0)

New UChicago Rationality Group
Noah Birnbaum (daniel-birnbaum) · 2024-11-08T21:20:34.485Z · comments (0)

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix
Jaehyuk Lim (jason-l) · 2024-10-11T23:06:14.340Z · comments (2)

[link] Jailbreaking language models with user roleplay
loops (smitop) · 2024-09-28T23:43:10.870Z · comments (0)

Dario Amodei's "Machines of Loving Grace" sound incredibly dangerous, for Humans
Super AGI (super-agi) · 2024-10-27T05:05:13.763Z · comments (1)

Consider tabooing "I think"
Adam Zerner (adamzerner) · 2024-11-12T02:00:08.433Z · comments (2)

[question] Set Theory Multiverse vs Mathematical Truth - Philosophical Discussion
Wenitte Apiou (wenitte-apiou) · 2024-11-01T18:56:06.900Z · answers+comments (25)

[link] Cooperation and Alignment in Delegation Games: You Need Both!
Oliver Sourbut · 2024-08-03T10:16:51.716Z · comments (0)

Three main arguments that AI will save humans and one meta-argument
avturchin · 2024-10-02T11:39:08.910Z · comments (8)

Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga · 2024-09-28T18:29:49.088Z · comments (0)

Foresight Vision Weekend 2024
Allison Duettmann (allison-duettmann) · 2024-10-01T21:59:55.107Z · comments (0)

[link] AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary
Corin Katzke (corin-katzke) · 2024-10-01T20:35:32.399Z · comments (0)

Steering LLMs' Behavior with Concept Activation Vectors
Ruixuan Huang (sprout_ust) · 2024-09-28T09:53:19.658Z · comments (0)

[question] Is it Legal to Maintain Turing Tests using Data Poisoning, and would it work?
Double · 2024-09-05T00:35:39.504Z · answers+comments (9)

Thinking About Propensity Evaluations
Maxime Riché (maxime-riche) · 2024-08-19T09:23:55.091Z · comments (0)

[link] [Linkpost] Automated Design of Agentic Systems
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-19T23:06:06.669Z · comments (1)

[link] In-Context Learning: An Alignment Survey
alamerton · 2024-09-30T18:44:28.589Z · comments (0)

[link] Universal dimensions of visual representation
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-28T10:38:58.396Z · comments (0)

[link] It's important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation
Gerard Boxo (gerard-boxo) · 2024-10-14T17:04:57.010Z · comments (0)

[link] Nerdtrition: simple diets via spreadsheet abuse
dkl9 · 2024-10-27T21:45:15.117Z · comments (0)

Interpreting the effects of Jailbreak Prompts in LLMs
Harsh Raj (harsh-raj-ep-037) · 2024-09-29T19:01:10.113Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

benito on Sabotage Evaluations for Frontier Models

I'm having a hard time following this argument. To be clear, I'm saying that while certain people were in regulatory bodies in the US & UK govts, they actively had secret legal contracts to not criticize the leading industry player, else (prseumably) they could be sued for damages. This is not a past shady deals, this is about current people during their current tenure.

quetzal_rainbow on D0TheMath's Shortform

Completeness theorem states that consistent countable FO theory has a model. Compactness theorem states that FO theory has a model iff every finite subset of FO theory has a model. Both theorems are provable in ZFC.

Therefore:

Consistent(ZFC) <-> all finite subsets of ZFC have a model ->

not Consistent(ZFC) <-> some finite subsets of ZFC don't have a model ->

some finite subsets of ZFC + not Consistent(ZFC) don't have a model <->

not Consistent(ZFC + not Consistent(ZFC)),

proven in ZFC + not Consistent(ZFC)

ape-in-the-coat on Quantum Immortality: A Perspective if AI Doomers are Probably Right

If my π-guess is wrong, my only chance to survive is getting all-heads.

Your other chance for survival is that whatever means are used to kill you somehow does not succeed due to quantuum effects. And this is what QI+path-based identity approach actually predicts. The universe isn't going to reotroactively change the digit of pi, but neither it's going to influence the probability of the coin tosses due to the fact that someone may die. QI influence will trigger only at the moment of your death, turning it into near death. And then for the next attempt. And for the next one. Potentially locking you in a state of eternal torture.

However, abandoning SSSA also has a serious theoretical cost:
If observed probabilities have a hidden subjective dimension (because of path-dependency), all hell breaks loose. If we agree that probabilities of being a copy are distributed not in a state-dependent way, but in a path-dependent way, we agree that there is a 'hidden variable' in self-locating probabilities. This hidden variable does not play a role in our π experiment but appears in other thought experiments where the order of making copies is defined.

I fail to see this cost. Yes, we agree that there is an additional variable. Namely, my causal history. It's not necessary hidden but can as well be. So what? What is so hellbreaking about it? This is exactly how probability theory works in every other case. Why should it have a special case for conscious experience?

If there are two bags one with 1 red ball and another with 1000 blue balls and then the coin is tossed and based on the outcome I'm either getting a ball from the first or the second bag, I'm expecting to receive red ball with 50% chance. I'm not supposed to assume out of nowhere that every ball have to have equal probabilities to be given, therefore postulate a ball-shadow that will modify the fairness of the coin.

benito on Sabotage Evaluations for Frontier Models

I'm sorry, I'm confused about something here, I'll back up briefly and then respond to your point.

My model is:

The vast majority of people who've seriously thought about it believe we don't know how to solve the alignment problem.
More fundamentally, there's a sense in which we "basically don't know what we're doing" with regards to AI. People talk about "agents" and "goals" and "intentions" but we're kind of like at the phlogiston theory of heat or vitalism theory of life. We don't get it. We have no equations, we have no theory, we're just like "man these systems can really write and make pretty pictures" like we used to say "I don't get it but some things are hot and some things are cold". Science was tried, found hard, engineering was tried, found easy, and now we're only doing that. [LW · GW]
Many/most folks who've given it serious thought are pretty confident that the default outcome is doom (omnicide or permanent disempowerment), though it may be way kinda worse (e.g. eternal torture) or slightly better (e.g. we get to keep earth), due to intuitive arguments about instrumental goals and selecting on minds in the way machine learning works. (This framing is a bit local, in that not every scientist in the world would quite know what I'm referring to here.)
People are working hard and fast to build these AIs anyway because it's a profitable industry.

This literally spells the end of humanity (barring the eternal torture option or the grounded on earth option).

Back to your comment: some people are building AGI and knowingly threatening all of our lives. I propose they should show up and explain themselves.

A natural question is "Why should they talk with you Ben? You're just one of the 8 billion people whose lives they're threatening."

That is why I am further suggesting they talk with many of the great and worthy thinkers who are of the position this is clearly bad, like Hinton, Bengio, Russell, Yudkowsky, Bostrom, etc.

I am reading you say something like "But as long as someone is defending their behavior, they don't need to show up to defend it themselves."

This lands with me like we are two lowly peasants, who are talking about how the King has mistreated us by how the royal guards often beat us up and rape the women. I'm saying "I would like the King to answer for himself" and I'm hearing you say "But I know a guy in the next pub who thinks the King is making good choices with his powers. If you can argue with him, I don't see why the King needs to come down himself." I would like to have the people who are wielding the power defend themselves.

Again, this is not me proposing business norms, it's me saying "the people who are taking the action that looks like it kills us, I want those people in particular to show up and explain themselves".

remmelt-ellen on AI Safety Camp 10

Fair question. You can assume it is AoE.

Research leads are not going to be too picky in terms of what hour you send the application in,

There is no need to worry about the exact deadline. Even if you send in your application on the next day, that probably won't significantly impact your chances of getting picked up by your desired project(s).

Sooner is better, since many research leads will begin composing their teams after the 17th, but there is no hard cut-off point.

satron on Sabotage Evaluations for Frontier Models

Then we are actually broadly in agreement. I just think that instead of CEOs responding to the public, having anyone at their side (the side of AI alignment being possible) responding is enough. Just as an example that I came up with, if a critic says that some detail is a reason for why AI will be dangerous, I do agree that someone needs to respond to the argument. But I would be fine with it being someone other than the CEO.

That's why I am relatively optimistic about Anthropic hiring the guy who has been engaged with critic's argument for years.

harold-1 on Twelve Virtues of Rationality

Since reading this a few years ago, I've often thought about the Void and Musashi's advice. For the reference of anyone like me, the quote in its full context is comes after a description of the 'five fundamental stances' in 'the Way of the sword'. Here it is from the 2001 Wilson translation of the Five Rings -- I think it's worth thinking about together with the piece above:

The Lesson of Stance/No Stance
What is called Stance/No Stance means that there is no stance that you should take with your sword at all. However, as I place this within the Five Stances, there is a stance here. According to the chances your opponent takes, and according to his position and energy, your sword will be of a mind to cut down your opponent in fine fashion no matter where you place it. According to the moment, if you want to lower your sword a little from the Upper Stance, it will become a Middle Stance; if, according to the situation, you raise your sword a bit from the Middle Stance, it will become the Upper Stance. The Lower Stance, accordingly, may be raised a little to become the Middle Stance as well. This means that the two Side Stances, according to their position, may be moved a little to the center and become the Middle or Lower Stances.

This is the principle in which there is a stance and there is no stance. At its heart, this is first taking up the sword and cutting down your opponent, no matter what is done or how it happens. Whether you parry, slap, strike, hold back or touch your opponent's cutting sword, you must understand that all of these are opportunities to cut him down. To think, "I'll parry," or "I'll slap," or "I'll hit, hold or touch," will be insufficient for cutting him down. It is essential to think that anything at all is an opportunity to cut him down. You should investigate this thoroughly. With martial arts in the larger field, the placement of numbers of people is also a stance. All of these are opportunities to win a battle. It is wrong to be inflexible. You should make great efforts in this.

(The Japanese is here: https://www.koten.net/gorin/yaku/214/)

benito on Sabotage Evaluations for Frontier Models

I believe the disagreement is not about CEOs, it's about illegitimate power. If you'll allow me a brief detour, I'll try to explain.

Sometimes people grant other people power over them. For instance, I have agreed to work at my company. I've agreed that my CEO can fire me, and make many other demands of me, in exchange for money and other various demands I can make of him. Ideally we entered into this agreement freely and without inappropriate pressure.

Other times, people get power over people without any agreement or granting. Your parent typically has a lot of power over you until you are 18. They can determine what you eat, where you are physically located, what privacy you have, what resources you have, etc. Also, as has been very important for most of history, people have been able to be physically violent to one another and hurt people or even end their lives. Neither of these powers are come to consensually.

For the latter, an important question to ask is "How does one wield this power well? What does it mean to wield it well vs poorly?" There are many ways to parent, many choices about diet and schooling and sleep times and what are fair punishment. But some parents starve their children and beat them for not following instructions and sexually assault them. This is an inappropriate use of power.

There's a legitimacy that comes by being granted power, and an illegitimacy that comes with getting or wielding power that you were not granted.

I think that there's a big question about how to wield it well vs poorly, and how to respect people you have illegitimate powers over. Something I believe is that society functions better if we take seriously the attempt to wield it well. To not casually kill someone if you can get away with it and feel like it, but consider them as people worthy of respect, and ask how you can respect the people you've been non-consensually given power over.

This requires doing some work. It involves asking yourself what's a reasonable amount of effort to spend modeling their preferences given how much power you have over someone, it involves asking yourself if society has any good received wisdom on what to do with this particular power, and it involves engaging with people who are aggrieved by your use of power over them.

Now, the standard model for companies and businesses is a libtertarian-esque free market, where all trades are consensual and have no inappropriate pressure. This is like the first situation I describe, where a company has no people it has undue power over, no people who it can treat better or worse with the power it has over them.

The situation where you are building machines you believe may kill literally everyone, is like the second situation, where you have a very different power dynamic, where you're making choices that affect everyone's lives and that they had little-to-no say in. In such a situation, I think if you are going to do what is good and right, you owe it to show up and engage with those who believe you are using the power you have over them in ways that are seriously hurting them.

That's the difference between this CEO situation and all of the others. It's not about standards for CEOs, its about standards for illegitimate power.

This kind of talking-with-the-aggrieved-people-you-have-immense-power-over is a way of showing the people basic respect, and it is not present in this case. I believe these people are risking my life and many others', and they seem to me disrespectful and largely uninterested in showing up to talk with the people whose lives they are risking.

satron on Sabotage Evaluations for Frontier Models

I similarly don't see the need for any official endorsement of the arguments. For example if a critic says that such and such technicality will prevent us from building safe AI and someone responds that here are the reasons for why such and such technicality will not prevent us from building safe AI (maybe this particular one is unlikely by default for various reasons), then such and such technicality will just not prevent us from building safe AI. I don't see a need for a CEO to officially endorse the response.

There is a different type of technicalities which you actively need to work against. But even in this case, as long as someone has found a way to combat them, as long as relevant people in your company are aware of the solution, it is fine by me.

Even if there are technicalities that can't be solved in principle, they should be evaluated by technical people and discussed by the same technical people (like they are on Less wrong for example).

I am definitely not saying that I can pinpoint an exact solution to AI alignment, but there have been attempts [AF · GW] so promising that leading skeptics (like Yudkowski) have said "Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it's stupid. (I rarely give any review this positive on a first skim. Congrats.)"

Whether companies actually follow promising alignment techniques is an entirely different question. But having CEOs officially endorse such solutions as opposed to relevant specialists evaluating and implementing them doesn't seem strictly necessary to me.

cbiddulph on 5 ways to improve CoT faithfulness

The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.

For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.

However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.

In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".

Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."

At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.