LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Introducing METR's Autonomy Evaluation Resources
Megan Kinniment (megan-kinniment) · 2024-03-15T23:16:59.696Z · comments (0)

Review: Conor Moreton's "Civilization & Cooperation"
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-05-26T19:32:43.131Z · comments (8)

Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (60)

story-based decision-making
bhauth · 2024-02-07T02:35:27.286Z · comments (11)

[link] New report: Safety Cases for AI
joshc (joshua-clymer) · 2024-03-20T16:45:27.984Z · comments (14)

Based Beff Jezos and the Accelerationists
Zvi · 2023-12-06T16:00:08.380Z · comments (29)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

Public Call for Interest in Mathematical Alignment
Davidmanheim · 2023-11-22T13:22:09.558Z · comments (9)

[link] Large Language Models can Strategically Deceive their Users when Put Under Pressure.
ReaderM · 2023-11-15T16:36:04.446Z · comments (8)

Stagewise Development in Neural Networks
Jesse Hoogland (jhoogland) · 2024-03-20T19:54:06.181Z · comments (1)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)

Partial value takeover without world takeover
KatjaGrace · 2024-04-05T06:20:03.961Z · comments (23)

[link] Debating with More Persuasive LLMs Leads to More Truthful Answers
Akbir Khan (akbir-khan) · 2024-02-07T21:28:10.694Z · comments (14)

Singular learning theory: exercises
Zach Furman (zfurman) · 2024-08-30T20:00:03.785Z · comments (5)

On the abolition of man
Joe Carlsmith (joekc) · 2024-01-18T18:17:06.201Z · comments (18)

[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)

Three Notions of "Power"
johnswentworth · 2024-10-30T06:10:08.326Z · comments (43)

Teaching CS During Take-Off
andrew carle (andrew-carle) · 2024-05-14T22:45:39.447Z · comments (13)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (10)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

How well do truth probes generalise?
mishajw · 2024-02-24T14:12:19.729Z · comments (11)

[link] More Hyphenation
Arjun Panickssery (arjun-panickssery) · 2024-02-07T19:43:29.086Z · comments (19)

[link] Self-Help Corner: Loop Detection
adamShimi · 2024-10-02T08:33:23.487Z · comments (6)

[link] Detecting Genetically Engineered Viruses With Metagenomic Sequencing
jefftk (jkaufman) · 2024-06-27T14:01:34.868Z · comments (10)

I'm a bit skeptical of AlphaFold 3
Oleg Trott (oleg-trott) · 2024-06-25T00:04:41.274Z · comments (14)

Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort (stanislavfort) · 2024-08-29T17:17:47.136Z · comments (8)

Research update: Towards a Law of Iterated Expectations for Heuristic Estimators
Eric Neyman (UnexpectedValues) · 2024-10-07T19:29:29.033Z · comments (2)

We might be missing some key feature of AI takeoff; it'll probably seem like "we could've seen this coming"
Lukas_Gloor · 2024-05-09T15:43:11.490Z · comments (36)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (18)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

The Aspiring Rationalist Congregation
maia · 2024-01-10T22:52:54.298Z · comments (23)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (20)

OpenAI: Helen Toner Speaks
Zvi · 2024-05-30T21:10:02.938Z · comments (8)

There is a globe in your LLM
jacob_drori (jacobcd52) · 2024-10-08T00:43:40.300Z · comments (4)

A Crisper Explanation of Simulacrum Levels
Thane Ruthenis · 2023-12-23T22:13:52.286Z · comments (13)

Apply to be a Safety Engineer at Lockheed Martin!
yanni kyriacos (yanni) · 2024-03-31T21:02:08.499Z · comments (3)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (51)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

Addressing Feature Suppression in SAEs
Benjamin Wright (Benw8888) · 2024-02-16T18:32:51.927Z · comments (3)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (14)

[Valence series] 2. Valence & Normativity
Steven Byrnes (steve2152) · 2023-12-07T16:43:49.919Z · comments (5)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (39)

[link] Anxiety vs. Depression
Sable · 2024-03-17T00:15:08.255Z · comments (35)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

Rejecting Television
Declan Molony (declan-molony) · 2024-04-23T04:59:50.253Z · comments (10)

5 homegrown EA projects, seeking small donors
Austin Chen (austin-chen) · 2024-10-28T23:24:25.745Z · comments (4)

[link] Environmentalism in the United States Is Unusually Partisan
Jeffrey Heninger (jeffrey-heninger) · 2024-05-13T21:23:10.755Z · comments (26)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

Newsom Vetoes SB 1047
Zvi · 2024-10-01T12:20:06.127Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

rain8dome9 on Resolving von Neumann-Morgenstern Inconsistent Preferences

Is this a paper? Has it been published anywhere?

benito on Lighthaven Sequences Reading Group #10 (Tuesday 11/12)

By the way, for my circle tonight, I'd like to do something a little different, involving writing at least as much as talking. If you might like to join me, please bring your laptop.

catherio on evhub's Shortform

COI: I work at Anthropic

I confirmed internally (which felt personally important for me to do) that our partnership with Palantir is still subject to the same terms outlined in the June post "Expanding Access to Claude for Government":

For example, we have crafted a set of contractual exceptions to our general Usage Policy that are carefully calibrated to enable beneficial uses by carefully selected government agencies. These allow Claude to be used for legally authorized foreign intelligence analysis, such as combating human trafficking, identifying covert influence or sabotage campaigns, and providing warning in advance of potential military activities, opening a window for diplomacy to prevent or deter them. All other restrictions in our general Usage Policy, including those concerning disinformation campaigns, the design or use of weapons, censorship, and malicious cyber operations, remain.

The contractual exceptions are explained here (very short, easy to read): https://support.anthropic.com/en/articles/9528712-exceptions-to-our-usage-policy

The core of that page is as follows, emphasis added by me:

For example, with carefully selected government entities, we may allow foreign intelligence analysis in accordance with applicable law. All other use restrictions in our Usage Policy, including those prohibiting use for disinformation campaigns, the design or use of weapons, censorship, domestic surveillance, and malicious cyber operations, remain.

This is all public (in Anthropic's up-to-date support.anthropic.com portal). Additionally it was announced when Anthropic first announced its intentions and approach around government in June.

sarahconstantin on sarahconstantin's Shortform

neutrality (notes towards a blog post): https://roamresearch.com/#/app/srcpublic/page/Ql9YwmLas

"neutrality is impossible" is sort-of-true, actually, but not a reason to give up.
- even a "neutral" college class (let's say a standard algorithms & data structures CS class) is non-neutral relative to certain beliefs
  - some people object to the structure of universities and their classes to begin with;
  - some people may object on philosophical grounds to concepts that are unquestionably "standard" within a field like computer science.
  - some people may think "apolitical" education is itself unacceptable.
    - to consider a certain set of topics "political" and not mention them in the classroom is, implicitly, to believe that it is not urgent to resolve or act on those issues (at least in a classroom context), and therefore it implies some degree of acceptance of the default state of those issues.
  - our "neutral" CS class is implicitly taking a stand on certain things and in conflict with certain conceivable views. but, there's a wide range of views, including (I think) the vast majority of the actual views of relevant parties like students and faculty, that will find nothing to object to in the class.
- we need to think about neutrality in more relative terms:
  - what rule are you using, and what things are you claiming it will be neutral between?
what is neutrality anyway and when/why do you want it?
- neutrality is a type of tactic for establishing cooperation between different entities.
  - one way (not the only way) to get all parties to cooperate willingly is to promise they will be treated equally.
  - this is most important when there is actual uncertainty about the balance of power.
    - eg the Dutch Republic was the first European polity to establish laws of religious tolerance, because it happened to be roughly evenly divided between multiple religions and needed to unite to win its independence.
- a system is neutral towards things when it treats them the same.
  - there lots of ways to treat things the same:
    - "none of these things belong here"
      - eg no religion in "public" or "secular" spaces
        is the "public secular space" the street? no-hijab rules?
        or is it the government? no 10 Commandments in the courthouse?
    - "each of these things should get equal treatment"
      - eg Fairness Doctrine
    - "we will take no sides between these things; how they succeed or fail is up to you"
      - e.g. "marketplace of ideas", "colorblindness"
- one can always ask, about any attempt at procedural neutrality:
  - what things does it promise to be neutral between?
    - are those the right or relevant things to be neutral on?
  - to what degree, and with what certainty, does this procedure produce neutrality?
    - is it robust to being intentionally subverted?
- here and now, what kind of neutrality do we want?
  - thanks to the Internet, we can read and see all sorts of opinions from all over the world. a wider array of worldviews are plausible/relevant/worth-considering than ever before. it's harder to get "on the same page" with people because they may have come from very different informational backgrounds.
  - even tribes are fragmented. even people very similar to one another can struggle to synch up and collaborate, except in lowest-common-denominator ways that aren't very productive.
  - narrowing things down to US politics, no political tribe or ideology is anywhere close to a secure monopoly. nor are "tribes" united internally.
  - we have relied, until now, on a deep reserve of "normality" -- apolitical, even apathetic, Just The Way Things Are. In the US that means, people go to work at their jobs and get paid for it and have fun in their free time. 90's sitcom style.
    - there's still more "normality" out there than culture warriors tend to believe, but it's fragile. As soon as somebody asks "why is this the way things are?" unexamined normality vanishes.
      - to the extent that the "normal" of the recent past was functional, this is a troubling development...but in general the operation of the mind is a good thing!
      - we just have more rapid and broader idea propagation now.
        why did "open borders" and "abolish the police" and "UBI" take off recently? because these are simple ideas with intuitive appeal. some % of people will think "that makes sense, that sounds good" once they hear of them. and now, way more people are hearing those kinds of ideas.
  - when unexamined normality declines, conscious neutrality may become more important.
    - conscious neutrality for the present day needs to be aware of the wide range of what people actually believe today, and avoid the naive Panglossianism of early web 2.0.
      - many people believe things you think are "crazy".
      - "democratization" may lead to the most popular ideas being hateful, trashy, or utterly bonkers.
      - on the other hand, depending on what you're trying to get done, you may very well need to collaborate with allies, or serve populations, whose views are well outside your comfort zone.
    - neutrality has things to offer:
      - a way to build trust with people very different from yourself, without compromising your own convictions;
        "I don't agree with you on A, but you and I both value B, so I promise to do my best at B and we'll leave A out of it altogether"
      - a way to reconstruct some of the best things about our "unexamined normality" and place them on a firmer foundation so they won't disappear as soon as someone asks "why?"
a "system of the world" is the framework of your neutrality: aka it's what you're not neutral about.
- eg:
  - "melting pot" multiculturalism is neutral between cultures, but does believe that they should mostly be cosmetic forms of diversity (national costumes and ethnic foods) while more important things are "universal" and shared.
  - democratic norms are neutral about who will win, but not that majority vote should determine the winner.
  - scientific norms are neutral about which disputed claims will turn out to be true, but not on what sorts of processes and properties make claims credible, and not about certain well-established beliefs
- right now our system-of-the-world is weak.
  - a lot of it is literally decided by software affordances. what the app lets you do is what there is.
    - there's a lot that's healthy and praiseworthy about software companies and their culture, especially 10-20 years ago. but they were never prepared for that responsibility!
- a stronger system-of-the-world isn't dogmatism or naivety.
  - were intellectuals of the 20th, the 19th, or the 18th centuries childish because they had more explicit shared assumptions than we do? I don't think so.
    - we may no longer consider some of their frameworks to be true
    - but having a substantive framework at all clearly isn't incompatible with thinking independently, recognizing that people are flawed, or being open to changing your mind.
    - "hedgehogs" or "eternalists" are just people who consider some things definitely true.
      - it doesn't mean they came to those beliefs through "blind faith" or have never questioned them.
      - it also doesn't mean they can't recognize uncertainty about things that aren't foundational beliefs.
    - operating within a strongly-held, assumed-shared worldview can be functional for making collaborative progress, at least when that worldview isn't too incompatible with reality.
  - mathematics was "non-rigorous", by modern standards, until the early 20th century; and much of today's mathematics will be considered "non-rigorous" if machine-verified proofs ever become the norm. but people were still able to do mathematics in centuries past, most of which we still consider true.
    - the fact that you can generate a more general framework, within which the old framework was a special case; or in which the old framework was an unprincipled assumption of the world being "nicely behaved" in some sense; does not mean that the old framework was not fruitful for learning true things.
      - sometimes, taking for granted an assumption that's not literally always true (but is true mostly, more-or-less, or in the practically relevant cases) can even be more fruitful than a more radically skeptical and general view.
- an *intellectual* system-of-the-world is the framework we want to use for the "republic of letters", the sub-community of people who communicate with each other in a single conversational web and value learning and truth.
  - that community expanded with the printing press and again with the internet.
  - it is radically diverse in opinion.
  - it is not literally universal. not everybody likes to read and write; not everybody is curious or creative. a lot of the "most interesting people in the world" influence each other.
    - everybody in the old "blogosphere" was, fundamentally, the same sort of person, despite our constant arguments with each other; and not a common sort of person in the broader population; and we have turned out to be more influential than we have ever been willing to admit.
  - but I do think of it as a pretty big and growing tent, not confined to 300 geniuses or anything like that.
    - "The" conversation -- the world's symbolic information and its technological infrastructure -- is something anybody can contribute to, but of course some contribute more than others.
    - I think the right boundary to draw is around "power users" -- people who participate in that network heavily rather than occasionally.
      - e.g. not all academics are great innovators, but pretty much all of them are "power users" and "active contributors" to the world's informational web.
      - I'm definitely a power user; I expect a lot of my readers are as well.
  - what do we need to not be neutral about in this context? what belongs in an intellectual system-of-the-world?
    - another way of asking this question: about what premises are you willing to say, not just for yourself but for the whole world and for your children's children, "if you don't accept this premise then I don't care to speak to you or hear from you, forever?"
      - clearly that's a high standard!
      - I have many values differences with, say, the author of the Epic of Gilgamesh, but I still want to read it. And I want lots of other people to be able to read it! I do not want the mind that created it to be blotted out of memory.
      - that's the level of minimal shared values we're talking about here. What do we have in common with everyone who has an interest in maintaining and extending humanity's collective record of thought?
    - lack of barriers to entry is not enough.
      - the old Web 2.0 idea was "allow everyone to communicate with everyone else, with equal affordances." This is a kind of "neutrality" -- every user account starts out exactly the same, and anybody can make an account.
        I think that's still an underrated principle. "literally anybody can speak to anybody else who wants to listen" was an invention that created a lot of valuable affordances. we forget how painfully scarce information was when that wasn't true!
      - the problem is that an information system only works when a user can find the information they seek. And in many cases, what the user is seeking is true information.
      - mechanisms intended to make high quality information (reliable, accurate, credible, complete, etc) preferentially discoverable, are also necessary
        but they shouldn't just recapitulate potentially-biased gatekeeping.
        we want evaluative systems that, at least a priori, an ancient Sumerian could look at and say "yep, sounds fair", even if the Sumerian wouldn't like the "truths" that come out on top in those systems.
        we really can't be parochial here. social media companies "patched" the problem of misinformation with opaque, partisan side-taking, and they suffered for it.
        how "meta" do we have to get about determining what counts as reliable or valid? well, more meta than just picking a winning side in an ongoing political dispute, that's for sure.
        probably also more "meta" than handpicking certain sources as trustworthy, the way Wikipedia does.
- if we want to preserve and extend knowledge, the "republic of letters" needs intentional stewardship of the world's information, including serious attempts at neutrality.
  - perceived bias, of course, turns people away from information sources.
  - nostalgia for unexamined normality -- "just be neutral, y'know, like we were when I was young" -- is not a credible offer to people who have already found your nostalgic "normal" wanting.
  - rigorous neutrality tactics -- "we have so structured this system so that it is impossible for anyone to tamper with it in a biased fashion" -- are better.
    - this points towards protocols.
      - h/t Venkatesh Rao
      - think: zero-knowledge proofs, formal verification, prediction markets, mechanism design, crypto-flavored governance schemes, LLM-enabled argument mapping, AI mechanistic-interpretability and "showing its work", etc
    - getting fancy with the technology here often seems premature when the "public" doesn't even want neutrality; but I don't think it actually is.
      - people don't know they want the things that don't yet exist.
      - the people interested in developing "provably", "rigorously", "demonstrably" impartial systems are exactly the people you want to attract first, because they care the most.
      - getting it right matters.
        a poorly executed attempt either fizzles instantly; or it catches on but its underlying flaws start to make it actively harmful once it's widely culturally influential.
    - OTOH, premature disputes on technology and methods are undesirable.
      - remember there aren't very many of you/us. that is:
        pretty much everybody who wants to build rigorous neutrality, no matter why they want it or how they want to implement it, is a potential ally here.
        the simple fact of wanting to build a "better" world that doesn't yet exist is a commonality, not to be taken for granted. most people don't do this at all.
        the "softer" side, mutual support and collegiality, are especially important to people whose dreams are very far from fruition. people in this situation are unusually prone to both burnout and schism. be warm and encouraging; it helps keep dreams alive.
        also, the whole "neutrality" thing is a sham if we can't even engage with collaborators with different views and cultural styles.
        also, "there aren't very many of us" in the sense that none of these envisioned new products/tools/institutions are really off the ground yet, and the default outcome is that none of them get there.
        you are playing in a sandbox. the goal is to eventually get out of the sandbox.
        you will need to accumulate talent, ideas, resources, and vibe-momentum. right now these are scarce, or scattered; they need to be assembled.
        be realistic about influence.
        count how many people are at the conference or whatever. how many readers. how many users. how many dollars. in absolute terms it probably isn't much. don't get pretentious about a "movement", "community", or "industry" before it's shown appreciable results.
        the "adjacent possible" people to get involved aren't the general public, they're the closest people in your social/communication graph who aren't yet participating. why aren't they part of the thing? (or why don't you feel comfortable going to them?) what would you need to change to satisfy the people you actually know?
        this is a better framing than speculating about mass appeal.

nathan-helm-burger on eggsyntax's Shortform

My current top picks for general reasoning in AI discussion are:

https://arxiv.org/abs/2409.05513

https://m.youtube.com/watch?v=JTU8Ha4Jyfc

sharmake-farah on Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI

My main predictions on how the AI debate will go over the next several years, assuming that AI progress continues:

There could well a large portion of the public freaked out, and my prediction is that it will range in the 10-50% of people who want to ban AI at any cost.
Polarization will happen along pro/anti-AI lines, and more importantly the bipartisan consensus on AI will likely collapse into polarized camps.
Republicans will shift into being AI accelerationists, while Democrats will shift more into the AI safety camp.
Maybe the AI backlash doesn't occur, or is far weaker than people think once prices collapse for some goods, and maybe the AI unemployment factor turns out to be tolerable for the public.

I don't give the 4th scenario a high chance, but it is worth keeping in mind.

(One of my takeaways in the 2024 election results around the world is that people are fine with lots of unemployment, but hate price increases, and this might apply to AGI too.)

eggsyntax on LLMs Look Increasingly Like General Reasoners

Interesting question! Maybe it would look something like, 'In my experience, the first answer to multiple-choice questions tends to be the correct one, so I'll pick that'?

It does seem plausible on the face of it that the model couldn't provide a faithful CoT on its fine-tuned behavior. But that's my whole point: we can't always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.

But also @James Chua [LW · GW] and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg 'Looking Inward'), and I'm not confident that models couldn't introspect on fine-tuned behavior.

shankar-sivarajan on The Online Sports Gambling Experiment Has Failed

Without looking it up, I'd bet there are plenty of people who get added to this list by mistake, and can't get themselves removed, like the people who got put on the US's no-fly list, or get declared dead.

tailcalled on Blood Is Thicker Than Water 🐬

I had been playing on and off with the idea that an ecological argument would show dolphins to be ultimately fish-like, but with my switch in general approach to things, I think ultimately the "dolphins are not fish" side wins out. Some of the most noteworthy characteristics of dolphins is that they are large, intelligent, social animals which provide parental care for extensive periods of time. There are literally 0 fish species with this combination of traits, whereas meanwhile the combo obviously screams "mammal!".

I was wondering if it was specific to mammals or if it applied to land vertebrates in general. Other animals that have evolved to live in the ocean include sea turtles and sea snakes, but they are quite land-animal-like to me. Frogs are an interesting edge-case in that at least they have marine-like offspring counts, but they literally have legs so obviously they don't count.

johnswentworth on johnswentworth's Shortform

I am posting this now mostly because I've heard it from multiple sources. I don't know to what extent those sources are themselves correlated (i.e. whether or not the rumor started from one person).