LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

I'm creating a deep dive podcast episode about the original Leverage Research - would you like to take part?
spencerg · 2024-09-22T14:03:22.164Z · comments (2)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

AI Safety Camp 10
Robert Kralisch (nonmali-1) · 2024-10-26T11:08:09.887Z · comments (7)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

Is the Power Grid Sustainable?
jefftk (jkaufman) · 2024-10-26T02:30:06.612Z · comments (37)

[link] Big tech transitions are slow (with implications for AI)
jasoncrawford · 2024-10-24T14:25:06.873Z · comments (16)

Eye contact is effortless when you’re no longer emotionally blocked on it
Chipmonk · 2024-09-27T21:47:01.970Z · comments (24)

Video and transcript of presentation on Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-10-08T22:30:38.054Z · comments (1)

Book Review: On the Edge: The Gamblers
Zvi · 2024-09-24T11:50:06.065Z · comments (1)

Open Problems in AIXI Agent Foundations
Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · comments (2)

Monthly Roundup #22: September 2024
Zvi · 2024-09-17T12:20:08.297Z · comments (10)

[link] On Fables and Nuanced Charts
Niko_McCarty (niko-2) · 2024-09-08T17:09:07.503Z · comments (2)

AI #89: Trump Card
Zvi · 2024-11-07T16:30:05.684Z · comments (12)

The Cognitive Bootcamp Agreement
Raemon · 2024-10-16T23:24:05.509Z · comments (0)

[question] If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?
KvmanThinking (avery-liu) · 2024-10-03T11:31:19.974Z · answers+comments (36)

Live Machinery: An Interface Design Philosophy for Wholesome AI Futures (Workshop @ EA Hotel!)
Sahil · 2024-11-01T17:24:09.957Z · comments (2)

Augmenting Statistical Models with Natural Language Parameters
jsteinhardt · 2024-09-20T18:30:10.816Z · comments (0)

ARENA4.0 Capstone: Hyperparameter tuning for MELBO + replication on Llama-3.2-1b-Instruct
25Hour (aaron-kaufman) · 2024-10-05T11:30:11.953Z · comments (2)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need
Sodium · 2024-10-03T19:11:58.032Z · comments (17)

What AI companies should do: Some rough ideas
Zach Stein-Perlman · 2024-10-21T14:00:10.412Z · comments (10)

[link] Information dark matter
Logan Kieller (logan-kieller) · 2024-10-01T15:05:41.159Z · comments (4)

Empathy/Systemizing Quotient is a poor/biased model for the autism/sex link
tailcalled · 2024-11-04T21:11:57.788Z · comments (0)

My disagreements with "AGI ruin: A List of Lethalities"
Noosphere89 (sharmake-farah) · 2024-09-15T17:22:18.367Z · comments (44)

Proveably Safe Self Driving Cars [Modulo Assumptions]
Davidmanheim · 2024-09-15T13:58:19.472Z · comments (26)

Intent alignment as a stepping-stone to value alignment
Seth Herd · 2024-11-05T20:43:24.950Z · comments (4)

[link] Concrete benefits of making predictions
Jonny Spicer (jonnyspicer) · 2024-10-17T14:23:17.613Z · comments (5)

Housing Roundup #10
Zvi · 2024-10-29T13:50:09.416Z · comments (2)

RLHF is the worst possible thing done when facing the alignment problem
tailcalled · 2024-09-19T18:56:27.676Z · comments (10)

[link] NAO Updates, Fall 2024
jefftk (jkaufman) · 2024-10-18T00:00:04.142Z · comments (2)

A path to human autonomy
Nathan Helm-Burger (nathan-helm-burger) · 2024-10-29T03:02:42.475Z · comments (12)

DunCon @Lighthaven
Duncan Sabien (Deactivated) (Duncan_Sabien) · 2024-09-29T04:56:27.205Z · comments (0)

An argument that consequentialism is incomplete
cousin_it · 2024-10-07T09:45:12.754Z · comments (27)

Resolving von Neumann-Morgenstern Inconsistent Preferences
niplav · 2024-10-22T11:45:20.915Z · comments (3)

[link] What is it like to be psychologically healthy? Podcast ft. DaystarEld
Chipmonk · 2024-10-05T19:14:04.743Z · comments (8)

[question] When is reward ever the optimization target?
Noosphere89 (sharmake-farah) · 2024-10-15T15:09:20.912Z · answers+comments (12)

The slingshot helps with learning
Wilson Wu (wilson-wu) · 2024-10-31T23:18:16.762Z · comments (0)

[question] Feedback request: what am I missing?
Nathan Helm-Burger (nathan-helm-burger) · 2024-11-02T17:38:39.625Z · answers+comments (5)

Book Review: What Even Is Gender?
Joey Marcellino · 2024-09-01T16:09:27.773Z · comments (14)

Apply to MATS 7.0!
Ryan Kidd (ryankidd44) · 2024-09-21T00:23:49.778Z · comments (0)

[link] Stone Age Herbalist's notes on ant warfare and slavery
trevor (TrevorWiesinger) · 2024-11-09T02:40:01.128Z · comments (0)

Meme Talking Points
ymeskhout · 2024-11-06T15:27:54.024Z · comments (0)

Context-dependent consequentialism
Jeremy Gillen (jeremy-gillen) · 2024-11-04T09:29:24.310Z · comments (2)

[question] What's the Deal with Logical Uncertainty?
Ape in the coat · 2024-09-16T08:11:43.588Z · answers+comments (23)

Balancing Label Quantity and Quality for Scalable Elicitation
Alex Mallen (alex-mallen) · 2024-10-24T16:49:00.939Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

philosofer123 on How to Live Well: My Philosophy of Life

Thank you for reading and commenting.

In the document, I provide arguments for each of my philosophical positions. The recommended readings expand on these arguments and provide responses to potential counterarguments.

I am neither a psychological egoist nor an ethical egoist. On page 5, I list "concern for other sentient beings" as a plausible ultimate motivational consideration. While I go on to recommend aiming for peace of mind, I note that "in situations in which one doubts that aiming for peace of mind would be in accordance with the preponderance of one’s ultimate motivational considerations, one may appeal directly to those considerations." (page 6)

gwern on johnswentworth's Shortform

Also worth noting Dustin Moskowitz was a prominent enough donor this election cycle, for Harris, to get highlighted in news coverage of her donors: https://www.washingtonexaminer.com/news/campaigns/presidential/3179215/kamala-harris-influential-megadonors/ https://www.nytimes.com/2024/10/09/us/politics/harris-billion-dollar-fundraising.html

sarahconstantin on Eli's shortform feed

I agree that more people should be starting revenue-funded/bootstrapped businesses (including ones enabled by software/technology).

The meme is that if you're starting a tech company, it's going to be a VC-funded startup. This is, I think, a meme put out by VCs themselves, including Paul Graham/YCombinator, and it conflates new software projects and businesses generally with a specific kind of business model called the "tech startup".

Not every project worth doing should be a business (some should be hobbies or donation-funded) and not every business worth doing should be a VC-funded startup (some should be bootstrapped and grow from sales revenue.)

The VC startup business model requires rapid growth and expects 30x returns over a roughly 5-10 year time horizon. That simply doesn't include every project worth doing. Some businesses are viable but are simply not likely to grow that much or that fast; some projects shouldn't be expected to be profitable at all and need philanthropic support.

I think the narrative that "tech startups are where innovation happens" is...badly incomplete, but still a hell of a lot more correct than "tech startups are net destructive".

Think about new technologies; then think about where they were developed. That process can ever happen end-to-end within a startup, but more often I think innovative startups are founded around IP developed while the founders were in academia; or the startup found a new use for open-source tools or tools developed within big companies. There simply isn't time to solve particularly hard technical problems if you have to get to profitability and 30x growth in 5 years. The startup format is primarily designed for finding product-market fit -- i.e. putting together existing technologies, packaging them as a "product" with a narrative about what and who it's for, and tweaking it until you find a context where people will pay for the product, and then making the whole thing bigger and bigger. You can do that in 5 years. But no, you can't do literally all of society's technological innovation within that narrow context!

(Part of the issue is that we still technically count very big tech companies as "startups" and they certainly qualify as "Silicon Valley", so if you conflate all of "tech" into one big blob it includes the kind of big engineering-heavy companies that have R&D departments with long time horizons. Is OpenAI a "tech startup"? Sure, in that it's a recently founded technology company. But it is under very different financial constraints from a YC startup.)

rain8dome9 on Resolving von Neumann-Morgenstern Inconsistent Preferences

Is this a paper? Has it been published anywhere?

benito on Lighthaven Sequences Reading Group #10 (Tuesday 11/12)

By the way, for my circle tonight, I'd like to do something a little different, involving writing at least as much as talking. If you might like to join me, please bring your laptop.

catherio on evhub's Shortform

COI: I work at Anthropic

I confirmed internally (which felt personally important for me to do) that our partnership with Palantir is still subject to the same terms outlined in the June post "Expanding Access to Claude for Government":

For example, we have crafted a set of contractual exceptions to our general Usage Policy that are carefully calibrated to enable beneficial uses by carefully selected government agencies. These allow Claude to be used for legally authorized foreign intelligence analysis, such as combating human trafficking, identifying covert influence or sabotage campaigns, and providing warning in advance of potential military activities, opening a window for diplomacy to prevent or deter them. All other restrictions in our general Usage Policy, including those concerning disinformation campaigns, the design or use of weapons, censorship, and malicious cyber operations, remain.

The contractual exceptions are explained here (very short, easy to read): https://support.anthropic.com/en/articles/9528712-exceptions-to-our-usage-policy

The core of that page is as follows, emphasis added by me:

For example, with carefully selected government entities, we may allow foreign intelligence analysis in accordance with applicable law. All other use restrictions in our Usage Policy, including those prohibiting use for disinformation campaigns, the design or use of weapons, censorship, domestic surveillance, and malicious cyber operations, remain.

This is all public (in Anthropic's up-to-date support.anthropic.com portal). Additionally it was announced when Anthropic first announced its intentions and approach around government in June.

sarahconstantin on sarahconstantin's Shortform

neutrality (notes towards a blog post): https://roamresearch.com/#/app/srcpublic/page/Ql9YwmLas

"neutrality is impossible" is sort-of-true, actually, but not a reason to give up.
- even a "neutral" college class (let's say a standard algorithms & data structures CS class) is non-neutral relative to certain beliefs
  - some people object to the structure of universities and their classes to begin with;
  - some people may object on philosophical grounds to concepts that are unquestionably "standard" within a field like computer science.
  - some people may think "apolitical" education is itself unacceptable.
    - to consider a certain set of topics "political" and not mention them in the classroom is, implicitly, to believe that it is not urgent to resolve or act on those issues (at least in a classroom context), and therefore it implies some degree of acceptance of the default state of those issues.
  - our "neutral" CS class is implicitly taking a stand on certain things and in conflict with certain conceivable views. but, there's a wide range of views, including (I think) the vast majority of the actual views of relevant parties like students and faculty, that will find nothing to object to in the class.
- we need to think about neutrality in more relative terms:
  - what rule are you using, and what things are you claiming it will be neutral between?
what is neutrality anyway and when/why do you want it?
- neutrality is a type of tactic for establishing cooperation between different entities.
  - one way (not the only way) to get all parties to cooperate willingly is to promise they will be treated equally.
  - this is most important when there is actual uncertainty about the balance of power.
    - eg the Dutch Republic was the first European polity to establish laws of religious tolerance, because it happened to be roughly evenly divided between multiple religions and needed to unite to win its independence.
- a system is neutral towards things when it treats them the same.
  - there lots of ways to treat things the same:
    - "none of these things belong here"
      - eg no religion in "public" or "secular" spaces
        is the "public secular space" the street? no-hijab rules?
        or is it the government? no 10 Commandments in the courthouse?
    - "each of these things should get equal treatment"
      - eg Fairness Doctrine
    - "we will take no sides between these things; how they succeed or fail is up to you"
      - e.g. "marketplace of ideas", "colorblindness"
- one can always ask, about any attempt at procedural neutrality:
  - what things does it promise to be neutral between?
    - are those the right or relevant things to be neutral on?
  - to what degree, and with what certainty, does this procedure produce neutrality?
    - is it robust to being intentionally subverted?
- here and now, what kind of neutrality do we want?
  - thanks to the Internet, we can read and see all sorts of opinions from all over the world. a wider array of worldviews are plausible/relevant/worth-considering than ever before. it's harder to get "on the same page" with people because they may have come from very different informational backgrounds.
  - even tribes are fragmented. even people very similar to one another can struggle to synch up and collaborate, except in lowest-common-denominator ways that aren't very productive.
  - narrowing things down to US politics, no political tribe or ideology is anywhere close to a secure monopoly. nor are "tribes" united internally.
  - we have relied, until now, on a deep reserve of "normality" -- apolitical, even apathetic, Just The Way Things Are. In the US that means, people go to work at their jobs and get paid for it and have fun in their free time. 90's sitcom style.
    - there's still more "normality" out there than culture warriors tend to believe, but it's fragile. As soon as somebody asks "why is this the way things are?" unexamined normality vanishes.
      - to the extent that the "normal" of the recent past was functional, this is a troubling development...but in general the operation of the mind is a good thing!
      - we just have more rapid and broader idea propagation now.
        why did "open borders" and "abolish the police" and "UBI" take off recently? because these are simple ideas with intuitive appeal. some % of people will think "that makes sense, that sounds good" once they hear of them. and now, way more people are hearing those kinds of ideas.
  - when unexamined normality declines, conscious neutrality may become more important.
    - conscious neutrality for the present day needs to be aware of the wide range of what people actually believe today, and avoid the naive Panglossianism of early web 2.0.
      - many people believe things you think are "crazy".
      - "democratization" may lead to the most popular ideas being hateful, trashy, or utterly bonkers.
      - on the other hand, depending on what you're trying to get done, you may very well need to collaborate with allies, or serve populations, whose views are well outside your comfort zone.
    - neutrality has things to offer:
      - a way to build trust with people very different from yourself, without compromising your own convictions;
        "I don't agree with you on A, but you and I both value B, so I promise to do my best at B and we'll leave A out of it altogether"
      - a way to reconstruct some of the best things about our "unexamined normality" and place them on a firmer foundation so they won't disappear as soon as someone asks "why?"
a "system of the world" is the framework of your neutrality: aka it's what you're not neutral about.
- eg:
  - "melting pot" multiculturalism is neutral between cultures, but does believe that they should mostly be cosmetic forms of diversity (national costumes and ethnic foods) while more important things are "universal" and shared.
  - democratic norms are neutral about who will win, but not that majority vote should determine the winner.
  - scientific norms are neutral about which disputed claims will turn out to be true, but not on what sorts of processes and properties make claims credible, and not about certain well-established beliefs
- right now our system-of-the-world is weak.
  - a lot of it is literally decided by software affordances. what the app lets you do is what there is.
    - there's a lot that's healthy and praiseworthy about software companies and their culture, especially 10-20 years ago. but they were never prepared for that responsibility!
- a stronger system-of-the-world isn't dogmatism or naivety.
  - were intellectuals of the 20th, the 19th, or the 18th centuries childish because they had more explicit shared assumptions than we do? I don't think so.
    - we may no longer consider some of their frameworks to be true
    - but having a substantive framework at all clearly isn't incompatible with thinking independently, recognizing that people are flawed, or being open to changing your mind.
    - "hedgehogs" or "eternalists" are just people who consider some things definitely true.
      - it doesn't mean they came to those beliefs through "blind faith" or have never questioned them.
      - it also doesn't mean they can't recognize uncertainty about things that aren't foundational beliefs.
    - operating within a strongly-held, assumed-shared worldview can be functional for making collaborative progress, at least when that worldview isn't too incompatible with reality.
  - mathematics was "non-rigorous", by modern standards, until the early 20th century; and much of today's mathematics will be considered "non-rigorous" if machine-verified proofs ever become the norm. but people were still able to do mathematics in centuries past, most of which we still consider true.
    - the fact that you can generate a more general framework, within which the old framework was a special case; or in which the old framework was an unprincipled assumption of the world being "nicely behaved" in some sense; does not mean that the old framework was not fruitful for learning true things.
      - sometimes, taking for granted an assumption that's not literally always true (but is true mostly, more-or-less, or in the practically relevant cases) can even be more fruitful than a more radically skeptical and general view.
- an *intellectual* system-of-the-world is the framework we want to use for the "republic of letters", the sub-community of people who communicate with each other in a single conversational web and value learning and truth.
  - that community expanded with the printing press and again with the internet.
  - it is radically diverse in opinion.
  - it is not literally universal. not everybody likes to read and write; not everybody is curious or creative. a lot of the "most interesting people in the world" influence each other.
    - everybody in the old "blogosphere" was, fundamentally, the same sort of person, despite our constant arguments with each other; and not a common sort of person in the broader population; and we have turned out to be more influential than we have ever been willing to admit.
  - but I do think of it as a pretty big and growing tent, not confined to 300 geniuses or anything like that.
    - "The" conversation -- the world's symbolic information and its technological infrastructure -- is something anybody can contribute to, but of course some contribute more than others.
    - I think the right boundary to draw is around "power users" -- people who participate in that network heavily rather than occasionally.
      - e.g. not all academics are great innovators, but pretty much all of them are "power users" and "active contributors" to the world's informational web.
      - I'm definitely a power user; I expect a lot of my readers are as well.
  - what do we need to not be neutral about in this context? what belongs in an intellectual system-of-the-world?
    - another way of asking this question: about what premises are you willing to say, not just for yourself but for the whole world and for your children's children, "if you don't accept this premise then I don't care to speak to you or hear from you, forever?"
      - clearly that's a high standard!
      - I have many values differences with, say, the author of the Epic of Gilgamesh, but I still want to read it. And I want lots of other people to be able to read it! I do not want the mind that created it to be blotted out of memory.
      - that's the level of minimal shared values we're talking about here. What do we have in common with everyone who has an interest in maintaining and extending humanity's collective record of thought?
    - lack of barriers to entry is not enough.
      - the old Web 2.0 idea was "allow everyone to communicate with everyone else, with equal affordances." This is a kind of "neutrality" -- every user account starts out exactly the same, and anybody can make an account.
        I think that's still an underrated principle. "literally anybody can speak to anybody else who wants to listen" was an invention that created a lot of valuable affordances. we forget how painfully scarce information was when that wasn't true!
      - the problem is that an information system only works when a user can find the information they seek. And in many cases, what the user is seeking is true information.
      - mechanisms intended to make high quality information (reliable, accurate, credible, complete, etc) preferentially discoverable, are also necessary
        but they shouldn't just recapitulate potentially-biased gatekeeping.
        we want evaluative systems that, at least a priori, an ancient Sumerian could look at and say "yep, sounds fair", even if the Sumerian wouldn't like the "truths" that come out on top in those systems.
        we really can't be parochial here. social media companies "patched" the problem of misinformation with opaque, partisan side-taking, and they suffered for it.
        how "meta" do we have to get about determining what counts as reliable or valid? well, more meta than just picking a winning side in an ongoing political dispute, that's for sure.
        probably also more "meta" than handpicking certain sources as trustworthy, the way Wikipedia does.
- if we want to preserve and extend knowledge, the "republic of letters" needs intentional stewardship of the world's information, including serious attempts at neutrality.
  - perceived bias, of course, turns people away from information sources.
  - nostalgia for unexamined normality -- "just be neutral, y'know, like we were when I was young" -- is not a credible offer to people who have already found your nostalgic "normal" wanting.
  - rigorous neutrality tactics -- "we have so structured this system so that it is impossible for anyone to tamper with it in a biased fashion" -- are better.
    - this points towards protocols.
      - h/t Venkatesh Rao
      - think: zero-knowledge proofs, formal verification, prediction markets, mechanism design, crypto-flavored governance schemes, LLM-enabled argument mapping, AI mechanistic-interpretability and "showing its work", etc
    - getting fancy with the technology here often seems premature when the "public" doesn't even want neutrality; but I don't think it actually is.
      - people don't know they want the things that don't yet exist.
      - the people interested in developing "provably", "rigorously", "demonstrably" impartial systems are exactly the people you want to attract first, because they care the most.
      - getting it right matters.
        a poorly executed attempt either fizzles instantly; or it catches on but its underlying flaws start to make it actively harmful once it's widely culturally influential.
    - OTOH, premature disputes on technology and methods are undesirable.
      - remember there aren't very many of you/us. that is:
        pretty much everybody who wants to build rigorous neutrality, no matter why they want it or how they want to implement it, is a potential ally here.
        the simple fact of wanting to build a "better" world that doesn't yet exist is a commonality, not to be taken for granted. most people don't do this at all.
        the "softer" side, mutual support and collegiality, are especially important to people whose dreams are very far from fruition. people in this situation are unusually prone to both burnout and schism. be warm and encouraging; it helps keep dreams alive.
        also, the whole "neutrality" thing is a sham if we can't even engage with collaborators with different views and cultural styles.
        also, "there aren't very many of us" in the sense that none of these envisioned new products/tools/institutions are really off the ground yet, and the default outcome is that none of them get there.
        you are playing in a sandbox. the goal is to eventually get out of the sandbox.
        you will need to accumulate talent, ideas, resources, and vibe-momentum. right now these are scarce, or scattered; they need to be assembled.
        be realistic about influence.
        count how many people are at the conference or whatever. how many readers. how many users. how many dollars. in absolute terms it probably isn't much. don't get pretentious about a "movement", "community", or "industry" before it's shown appreciable results.
        the "adjacent possible" people to get involved aren't the general public, they're the closest people in your social/communication graph who aren't yet participating. why aren't they part of the thing? (or why don't you feel comfortable going to them?) what would you need to change to satisfy the people you actually know?
        this is a better framing than speculating about mass appeal.

nathan-helm-burger on eggsyntax's Shortform

My current top picks for general reasoning in AI discussion are:

https://arxiv.org/abs/2409.05513

https://m.youtube.com/watch?v=JTU8Ha4Jyfc

sharmake-farah on Current Attitudes Toward AI Provide Little Data Relevant to Attitudes Toward AGI

My main predictions on how the AI debate will go over the next several years, assuming that AI progress continues:

There could well a large portion of the public freaked out, and my prediction is that it will range in the 10-50% of people who want to ban AI at any cost.
Polarization will happen along pro/anti-AI lines, and more importantly the bipartisan consensus on AI will likely collapse into polarized camps.
Republicans will shift into being AI accelerationists, while Democrats will shift more into the AI safety camp.
Maybe the AI backlash doesn't occur, or is far weaker than people think once prices collapse for some goods, and maybe the AI unemployment factor turns out to be tolerable for the public.

I don't give the 4th scenario a high chance, but it is worth keeping in mind.

(One of my takeaways in the 2024 election results around the world is that people are fine with lots of unemployment, but hate price increases, and this might apply to AGI too.)

eggsyntax on LLMs Look Increasingly Like General Reasoners

Interesting question! Maybe it would look something like, 'In my experience, the first answer to multiple-choice questions tends to be the correct one, so I'll pick that'?

It does seem plausible on the face of it that the model couldn't provide a faithful CoT on its fine-tuned behavior. But that's my whole point: we can't always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.

But also @James Chua [LW · GW] and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg 'Looking Inward'), and I'm not confident that models couldn't introspect on fine-tuned behavior.