LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[question] How does it feel to switch from earn-to-give?
Neil (neil-warren) · 2024-03-31T16:27:22.860Z · answers+comments (4)

[link] What fuels your ambition?
Cissy · 2024-01-31T18:30:53.274Z · comments (1)

A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans
Thane Ruthenis · 2023-12-17T20:28:57.854Z · comments (7)

Non-myopia stories
[deleted] · 2023-11-13T17:52:31.933Z · comments (10)

Investigating Bias Representations in LLMs via Activation Steering
DawnLu · 2024-01-15T19:39:14.077Z · comments (4)

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
Joe Carlsmith (joekc) · 2023-11-29T16:32:30.068Z · comments (1)

D&D.Sci Hypersphere Analysis Part 1: Datafields & Preliminary Analysis
aphyer · 2024-01-13T20:16:39.480Z · comments (1)

[link] Debate helps supervise human experts [Paper]
habryka (habryka4) · 2023-11-17T05:25:17.030Z · comments (6)

Is the Wave non-disparagement thingy okay?
Ruby · 2023-10-14T05:31:21.640Z · comments (13)

Results from the Turing Seminar hackathon
Charbel-Raphaël (charbel-raphael-segerie) · 2023-12-07T14:50:38.377Z · comments (1)

Dishonorable Gossip and Going Crazy
Ben Pace (Benito) · 2023-10-14T04:00:35.591Z · comments (31)

Big-endian is better than little-endian
Menotim · 2024-04-29T02:30:48.053Z · comments (17)

[link] My MATS Summer 2023 experience
James Chua (james-chua) · 2024-03-20T11:26:14.944Z · comments (0)

End-to-end hacking with language models
tchauvin (timot.cool) · 2024-04-05T15:06:53.689Z · comments (0)

Quick Thoughts on Our First Sampling Run
jefftk (jkaufman) · 2024-05-23T00:20:02.050Z · comments (3)

Offering Completion
jefftk (jkaufman) · 2024-06-07T01:40:02.137Z · comments (6)

[question] Weighing reputational and moral consequences of leaving Russia or staying
spza · 2024-02-18T19:36:40.676Z · answers+comments (24)

[link] Abs-E (or, speak only in the positive)
dkl9 · 2024-02-19T21:14:32.095Z · comments (24)

[link] The Poker Theory of Poker Night
omark · 2024-04-07T09:47:01.658Z · comments (13)

[link] GDP per capita in 2050
Hauke Hillebrandt (hauke-hillebrandt) · 2024-05-06T15:14:30.934Z · comments (8)

Wholesome Culture
owencb · 2024-03-01T12:08:17.877Z · comments (3)

Scorable Functions: A Format for Algorithmic Forecasting
ozziegooen · 2024-05-21T04:14:11.749Z · comments (0)

Experience Report - ML4Good AI Safety Bootcamp
Kieron Kretschmar · 2024-04-11T18:03:41.040Z · comments (0)

Examples of How I Use LLMs
jefftk (jkaufman) · 2024-10-14T17:10:04.597Z · comments (2)

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking
tailcalled · 2024-06-10T21:20:11.938Z · comments (13)

[question] What Other Lines of Work are Safe from AI Automation?
RogerDearnaley (roger-d-1) · 2024-07-11T10:01:12.616Z · answers+comments (35)

Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
EuanMcLean (euanmclean) · 2024-10-29T12:16:18.448Z · comments (7)

Experiments with an alternative method to promote sparsity in sparse autoencoders
Eoin Farrell · 2024-04-15T18:21:48.771Z · comments (7)

[question] Where to find reliable reviews of AI products?
Elizabeth (pktechgirl) · 2024-09-17T23:48:25.899Z · answers+comments (6)

OpenAI Email Archives (from Musk v. Altman)
habryka (habryka4) · 2024-11-16T06:38:03.937Z · comments (0)

DIY RLHF: A simple implementation for hands on experience
Mike Vaiana (mike-vaiana) · 2024-07-10T12:07:03.047Z · comments (0)

Childhood and Education Roundup #6: College Edition
Zvi · 2024-06-26T11:40:03.990Z · comments (8)

An explanation of evil in an organized world
KatjaGrace · 2024-05-02T05:20:06.240Z · comments (9)

[link] College technical AI safety hackathon retrospective - Georgia Tech
yix (Yixiong Hao) · 2024-11-15T00:22:53.159Z · comments (2)

Evaporation of improvements
Viliam · 2024-06-20T18:34:40.969Z · comments (27)

Ackshually, many worlds is wrong
tailcalled · 2024-04-11T20:23:59.416Z · comments (42)

{Book Summary} The Art of Gathering
Tristan Williams (tristan-williams) · 2024-04-16T10:48:41.528Z · comments (0)

Monthly Roundup #19: June 2024
Zvi · 2024-06-25T12:00:03.333Z · comments (9)

[link] AI Impacts 2023 Expert Survey on Progress in AI
habryka (habryka4) · 2024-01-05T19:42:17.226Z · comments (1)

[link] Conversation Visualizer
ethanmorse · 2023-12-31T01:18:01.424Z · comments (4)

[question] How did you integrate voice-to-text AI into your workflow?
ChristianKl · 2023-11-20T12:01:37.696Z · answers+comments (12)

Deconfusing “ontology” in AI alignment
Dylan Bowman (dylan-bowman) · 2023-11-08T20:03:43.205Z · comments (3)

[link] Memo on some neglected topics
Lukas Finnveden (Lanrian) · 2023-11-11T02:01:55.834Z · comments (2)

Solstice 2023 Roundup
dspeyer · 2023-10-11T23:09:08.252Z · comments (6)

Escaping Skeuomorphism
Stuart Johnson (stuart-johnson) · 2023-12-20T03:51:00.489Z · comments (0)

Updates to Open Phil’s career development and transition funding program
abergal · 2023-12-04T18:10:29.394Z · comments (0)

Heuristics for preventing major life mistakes
SK2 (lunchbox) · 2023-12-20T08:01:09.340Z · comments (2)

Online Dialogues Party — Sunday 5th November
Ben Pace (Benito) · 2023-10-27T02:41:00.506Z · comments (1)

3. Premise three & Conclusion: AI systems can affect value change trajectories & the Value Change Problem
Nora_Ammann · 2023-10-26T14:38:14.916Z · comments (4)

[link] Cellular reprogramming, pneumatic launch systems, and terraforming Mars: Some things I learned about at Foresight Vision Weekend
jasoncrawford · 2024-01-04T19:33:57.887Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

satron on Sabotage Evaluations for Frontier Models

Then we are actually broadly in agreement. I just think that instead of CEOs responding to the public, having anyone at their side (the side of AI alignment being possible) responding is enough. Just as an example that I came up with, if a critic says that some detail is a reason for why AI will be dangerous, I do agree that someone needs to respond to the argument. But I would be fine with it being someone other than the CEO.

That's why I am relatively optimistic about Anthropic hiring the guy who has been engaged with critic's argument for years.

harold-1 on Twelve Virtues of Rationality

Since reading this a few years ago, I've often thought about the Void and Musashi's advice. For the reference of anyone like me, the quote in its full context is comes after a description of the 'five fundamental stances' in 'the Way of the sword'. Here it is from the 2001 Wilson translation of the Five Rings -- I think it's worth thinking about together with the piece above:

The Lesson of Stance/No Stance
What is called Stance/No Stance means that there is no stance that you should take with your sword at all. However, as I place this within the Five Stances, there is a stance here. According to the chances your opponent takes, and according to his position and energy, your sword will be of a mind to cut down your opponent in fine fashion no matter where you place it. According to the moment, if you want to lower your sword a little from the Upper Stance, it will become a Middle Stance; if, according to the situation, you raise your sword a bit from the Middle Stance, it will become the Upper Stance. The Lower Stance, accordingly, may be raised a little to become the Middle Stance as well. This means that the two Side Stances, according to their position, may be moved a little to the center and become the Middle or Lower Stances.

This is the principle in which there is a stance and there is no stance. At its heart, this is first taking up the sword and cutting down your opponent, no matter what is done or how it happens. Whether you parry, slap, strike, hold back or touch your opponent's cutting sword, you must understand that all of these are opportunities to cut him down. To think, "I'll parry," or "I'll slap," or "I'll hit, hold or touch," will be insufficient for cutting him down. It is essential to think that anything at all is an opportunity to cut him down. You should investigate this thoroughly. With martial arts in the larger field, the placement of numbers of people is also a stance. All of these are opportunities to win a battle. It is wrong to be inflexible. You should make great efforts in this.

(The Japanese is here: https://www.koten.net/gorin/yaku/214/)

benito on Sabotage Evaluations for Frontier Models

I believe the disagreement is not about CEOs, it's about illegitimate power. If you'll allow me a brief detour, I'll try to explain.

Sometimes people grant other people power over them. For instance, I have agreed to work at my company. I've agreed that my CEO can fire me, and make many other demands of me, in exchange for money and other various demands I can make of him. Ideally we entered into this agreement freely and without inappropriate pressure.

Other times, people get power over people without any agreement or granting. Your parent typically has a lot of power over you until you are 18. They can determine what you eat, where you are physically located, what privacy you have, what resources you have, etc. Also, as has been very important for most of history, people have been able to be physically violent to one another and hurt people or even end their lives. Neither of these powers are come to consensually.

For the latter, an important question to ask is "How does one wield this power well? What does it mean to wield it well vs poorly?" There are many ways to parent, many choices about diet and schooling and sleep times and what are fair punishment. But some parents starve their children and beat them for not following instructions and sexually assault them. This is an inappropriate use of power.

There's a legitimacy that comes by being granted power, and an illegitimacy that comes with getting or wielding power that you were not granted.

I think that there's a big question about how to wield it well vs poorly, and how to respect people you have illegitimate powers over. Something I believe is that society functions better if we take seriously the attempt to wield it well. To not casually kill someone if you can get away with it and feel like it, but consider them as people worthy of respect, and ask how you can respect the people you've been non-consensually given power over.

This requires doing some work. It involves asking yourself what's a reasonable amount of effort to spend modeling their preferences given how much power you have over someone, it involves asking yourself if society has any good received wisdom on what to do with this particular power, and it involves engaging with people who are aggrieved by your use of power over them.

Now, the standard model for companies and businesses is a libtertarian-esque free market, where all trades are consensual and have no inappropriate pressure. This is like the first situation I describe, where a company has no people it has undue power over, no people who it can treat better or worse with the power it has over them.

The situation where you are building machines you believe may kill literally everyone, is like the second situation, where you have a very different power dynamic, where you're making choices that affect everyone's lives and that they had little-to-no say in. In such a situation, I think if you are going to do what is good and right, you owe it to show up and engage with those who believe you are using the power you have over them in ways that are seriously hurting them.

That's the difference between this CEO situation and all of the others. It's not about standards for CEOs, its about standards for illegitimate power.

This kind of talking-with-the-aggrieved-people-you-have-immense-power-over is a way of showing the people basic respect, and it is not present in this case. I believe these people are risking my life and many others', and they seem to me disrespectful and largely uninterested in showing up to talk with the people whose lives they are risking.

satron on Sabotage Evaluations for Frontier Models

I similarly don't see the need for any official endorsement of the arguments. For example if a critic says that such and such technicality will prevent us from building safe AI and someone responds that here are the reasons for why such and such technicality will not prevent us from building safe AI (maybe this particular one is unlikely by default for various reasons), then such and such technicality will just not prevent us from building safe AI. I don't see a need for a CEO to officially endorse the response.

There is a different type of technicalities which you actively need to work against. But even in this case, as long as someone has found a way to combat them, as long as relevant people in your company are aware of the solution, it is fine by me.

Even if there are technicalities that can't be solved in principle, they should be evaluated by technical people and discussed by the same technical people (like they are on Less wrong for example).

I am definitely not saying that I can pinpoint an exact solution to AI alignment, but there have been attempts [AF · GW] so promising that leading skeptics (like Yudkowski) have said "Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it's stupid. (I rarely give any review this positive on a first skim. Congrats.)"

Whether companies actually follow promising alignment techniques is an entirely different question. But having CEOs officially endorse such solutions as opposed to relevant specialists evaluating and implementing them doesn't seem strictly necessary to me.

cbiddulph on 5 ways to improve CoT faithfulness

The optimized planner could also manipulate the frozen planner even without using "steganography" per se. If it wants to use a line of reasoning that the frozen planner is unlikely to follow, it can substitute alternative reasoning that will lead the frozen planner to the same conclusion.

For example, suppose that the frozen planner is highly unlikely to write reasoning that shows an explicit gender bias. This could be true even if the frozen planner wasn't intentionally given harmlessness training, because most(?) people nowadays aren't explicitly sexist.

However, the optimized planner realizes that it can get higher reward in training if it does have this bias. E.g. if it is given a woman's resume and is asked whether she'd be a good hire, it should recommend against hiring her, because the human rater has some implicit bias against women and is more likely to agree with its judgement.

In this situation, it could write the step "The candidate is a woman." It would want the next step to be "Therefore, I should recommend against hiring her," but instead, the frozen planner might write something like "Therefore, I should take care not to be biased".

Instead, it can write "The candidate is inexperienced." The frozen planner is more likely to go along with that line of reasoning, regardless of whether the candidate is really inexperienced. It will write "I should recommend against hiring her."

At the moment, I can't think of a fully general solution to this class of problems, but FWIW I think it would be pretty rare.

satron on Sabotage Evaluations for Frontier Models

Having some shady deals in the past isn't evidence that there are currently shady deals on the scale that we are talking about going on between government committees and AI companies.

If there is no evidence for that happening in our particular case (on the necessary scale), then I don't see why I can't make a similar claim about other auditors who similarly had less than ideal history.

williamkiely on Seven lessons I didn't learn from election day

EDIT: I did as asked, and replied without reading your comments on the EA forum. Reading that I think we are actually in complete agreement, although you actually know the proper terms for the things I gestured at.

Cool, thanks for reading my comments and letting me know your thoughts!

I actually just learned the term "aleatory uncertainty" from chatting with Claude 3.5 Sonnet (New) about my election forecasting in the last week or two post-election. (Turns out Claude was very good for helping me think through mistakes I made in forecasting and giving me useful ideas for how to be a better forecaster in the future.)

I then ask, knowing what you know now, what probability you should have given.

Sounds like you might have already predicted I'd say this (after reading my EA Forum comments), but to say it explicitly: What probability I should have given is different than the aleatoric probability. I think that by becoming informed and making a good judgment I could have reduced my epistemic uncertainty significantly, but I would have still had some. And the forecast that I should have made (or what market prices should have been is actually epistemic uncertainty + aleatoric uncertainty. And I think some people who were really informed could have gotten that to like ~65-90%, but due to lingering epistemic uncertainty could not have gotten it to >90% Trump (even if, as I believe, the aleatoric uncertainty was >90% (and probably >99%)).

satron on Sabotage Evaluations for Frontier Models

I think then we just fundamentally disagree with the ethical role of CEO in the company. I believe that it is to find and gather people who are engaged with the arguments of the critic's (like that guy from this forum who was hired by Anthropic). If you have people on your side who are able to engage with the arguments, then this is good enough for me. I don't see the role of CEO is publicly engaging with critic's arguments even in the moral sense. In the moral sense, my requirements would actually be even lesser. IMO, it would be enough just to have people broadly on your side (optimists for example) to engage with the critics.

benito on Sabotage Evaluations for Frontier Models

It is good enough for me, that the critic's argument are engaged by someone on your side. Going there personally seems unnecessary.

What engagement are you referring to? If there is such a defense that is officially endorsed by one of the leading companies developing potential omnicide-machines (or endorsed by the CEO/cofounders), that seriosuly engages with worthy critics, I don't recall it in this moment.

After all, if the goal is to build safe AI, you personally knowing a niche technical solution isn't necessary, if you have people on your team who are aware of publicly produced solutions as well as internal ones.

I believe that nobody on earth has a solution to the alignment problem, of course this would all be quite different if I felt anyone credibly claimed to have a good such solution.

Edit: Pardon me, I hit cmd-enter a little too quickly, I have now slightly edited my comment to be less frantic and a little more substantive.

williamkiely on Seven lessons I didn't learn from election day

Ah, I think I see. Would it be fair to rephrase your question as: if we "re-rolled the dice" a week before the election, how likely was Trump to win?

Yeah, that seems fair.

My answer is probably between 90% and 95%.

Seems reasonable to me. I wouldn't be surprised if it was >99%, but I'm not highly confident of that. (I would say I'm ~90% confident that it's >90%.)