LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

Distillation of 'Do language models plan for future tokens'
TheManxLoiner · 2024-06-27T20:57:34.351Z · comments (2)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

[question] Thoughts on Francois Chollet's belief that LLMs are far away from AGI?
O O (o-o) · 2024-06-14T06:32:48.170Z · answers+comments (17)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

[link] my favourite Scott Sumner blog posts
DMMF · 2024-06-11T14:40:43.093Z · comments (0)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (8)

SAEs you can See: Applying Sparse Autoencoders to Clustering
Robert_AIZI · 2024-10-28T14:48:16.744Z · comments (0)

Thinking in 2D
sarahconstantin · 2024-10-20T19:30:05.842Z · comments (0)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

[link] Care Doesn't Scale
stavros · 2024-10-28T11:57:38.742Z · comments (1)

Talk: AI safety fieldbuilding at MATS
Ryan Kidd (ryankidd44) · 2024-06-23T23:06:37.623Z · comments (2)

Inferential Game: The Foraging (Ex-)Bandit
abstractapplic · 2024-11-11T16:59:42.058Z · comments (4)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

[question] How are you preparing for the possibility of an AI bust?
Nate Showell · 2024-06-23T19:13:45.247Z · answers+comments (16)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
Kola Ayonrinde (kola-ayonrinde) · 2024-10-30T22:50:45.642Z · comments (0)

[link] Emotional issues often have an immediate payoff
Chipmonk · 2024-06-10T23:39:40.697Z · comments (2)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

[link] A brief history of the automated corporation
owencb · 2024-11-04T14:35:04.906Z · comments (1)

An experiment on hidden cognition
Olli Järviniemi (jarviniemi) · 2024-07-22T03:26:05.564Z · comments (2)

Using an LLM perplexity filter to detect weight exfiltration
Adam Karvonen (karvonenadam) · 2024-07-21T18:18:05.612Z · comments (11)

A suite of Vision Sparse Autoencoders
Louka Ewington-Pitsos (louka-ewington-pitsos) · 2024-10-27T04:05:20.377Z · comments (0)

[link] Conventional footnotes considered harmful
dkl9 · 2024-10-01T14:54:01.732Z · comments (16)

[link] A primer on the next generation of antibodies
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-01T22:37:59.207Z · comments (0)

Housing Roundup #9: Restricting Supply
Zvi · 2024-07-17T12:50:05.321Z · comments (8)

[link] what becoming more secure did for me
Chipmonk · 2024-08-22T17:44:48.525Z · comments (5)

[link] Announcing Open Philanthropy's AI governance and policy RFP
Julian Hazell (julian-hazell) · 2024-07-17T02:02:39.933Z · comments (0)

The Wisdom of Living for 200 Years
Martin Sustrik (sustrik) · 2024-06-28T04:44:10.609Z · comments (3)

[link] MIRI's July 2024 newsletter
Harlan · 2024-07-15T21:28:17.343Z · comments (2)

[link] Beware the science fiction bias in predictions of the future
Nikita Sokolsky (nikita-sokolsky) · 2024-08-19T05:32:47.372Z · comments (20)

You're Playing a Rough Game
jefftk (jkaufman) · 2024-10-17T19:20:06.251Z · comments (2)

[question] When engaging with a large amount of resources during a literature review, how do you prevent yourself from becoming overwhelmed?
corruptedCatapillar · 2024-11-01T07:29:49.262Z · answers+comments (2)

[link] An Intuitive Explanation of Sparse Autoencoders for Mechanistic Interpretability of LLMs
Adam Karvonen (karvonenadam) · 2024-06-25T15:57:16.872Z · comments (0)

A Triple Decker for Elfland
jefftk (jkaufman) · 2024-10-11T01:50:02.332Z · comments (0)

[link] Death notes - 7 thoughts on death
Nathan Young · 2024-10-28T15:01:13.532Z · comments (1)

[link] Fictional parasites very different from our own
Abhishaike Mahajan (abhishaike-mahajan) · 2024-09-08T14:59:39.080Z · comments (0)

The case for more Alignment Target Analysis (ATA)
Chi Nguyen · 2024-09-20T01:14:41.411Z · comments (13)

Trying to be rational for the wrong reasons
Viliam · 2024-08-20T16:18:06.385Z · comments (8)

[link] UK AISI: Early lessons from evaluating frontier AI systems
Zach Stein-Perlman · 2024-10-25T19:00:21.689Z · comments (0)

Proving the Geometric Utilitarian Theorem
StrivingForLegibility · 2024-08-07T01:39:10.920Z · comments (0)

Abstractions are not Natural
Alfred Harwood · 2024-11-04T11:10:09.023Z · comments (21)

[link] Introduction to Super Powers (for kids!)
Shoshannah Tekofsky (DarkSym) · 2024-09-20T17:17:27.070Z · comments (0)

[question] When can I be numerate?
FinalFormal2 · 2024-09-12T04:05:27.710Z · answers+comments (3)

How to put California and Texas on the campaign trail!
Yair Halberstadt (yair-halberstadt) · 2024-11-06T06:08:25.673Z · comments (4)

The new ruling philosophy regarding AI
Mitchell_Porter · 2024-11-11T13:28:24.476Z · comments (0)

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
scasper · 2024-07-30T14:57:06.807Z · comments (0)

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics
DanielFilan · 2024-09-29T05:50:02.531Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

dagon on A Theory of Equilibrium in the Offense-Defense Balance

I think this is the right way to think of most anti-inductive (planner-adversarial or competitive exploitation) situations. Where there are multiple dimensions of assymetric capabilities, any change is likely to shift the equilibrium, but not necessarily by as much as the shift in component.

That said, tipping points are real, and sometimes a component shift can have a BIGGER effect, because it shifts the search to a new local minimum. In most cases, this is not actully entirely due to that component change, but the discovery and reconfiguration is triggered by it. The rise of mass shootings in the US is an example - there are a lot of causes, but the shift happened quite quickly.

Offense-defense is further confused as an example, because there are at least two different equilibria involved. when you say

The offense-defense balance is a concept that compares how easy it is to protect vs conquer or destroy resources.

Conquer control vs retain control is a different thing than destroy vs preserve. Frank Herbert claimed (via fiction) that "The people who can destroy a thing, they control it." but it's actually true in very few cases. The equilibrium of who gets what share of the value from something can shift very separately from the equilibrium of how much total value that thing provides.

sarahconstantin on sarahconstantin's Shortform

links 11/15/2024: https://roamresearch.com/#/app/srcpublic/page/11-15-2024

https://www.reddit.com/r/self/comments/1gleyhg/people_like_me_are_the_reason_trump_won/ a moderate/swing-voter (Obama, Trump, Biden) explains why he voted for Trump this time around:
- he thinks Kamala Harris was an "empty shell" and unlikable and he felt the campaign was manipulative and deceptive.
- he didn't like that she seemed to be a "DEI hire", but doesn't have a problem with black or female candidates generally, it's just that he resents cynical demographic box-checking.
  - this is a coherent POV -- he did vote for Obama, after all. and plenty of people are like "I want the best person regardless of demographics, not a person chosen for their demographics."
    - hm. why doesn't it seem natural to portray Obama as a "DEI hire"? his campaign made a bigger deal about race than Harris's, and he was criticized a lot for inexperience.
      - One guess: it's laughable to think Obama was chosen by anyone besides himself. He was not the Democratic Party's anointed -- that was Hillary. He's clearly an ambitious guy who wanted to be president on his own initiative and beat the odds to get the nomination. He can't be a "DEI hire" because he wasn't a hire at all.
      - another guess: Obama is clearly smart, speaks/writes in complete sentences, and welcomes lots of media attention and talks about his policies, while Harris has a tendency towards word salad, interviews poorly, avoids discussing issues, etc.
      - another guess: everyone seems to reject the idea that people prefer male to female candidates, but I'm still really not sure there isn't a gender effect! This is very vibes-based on my part, and apparently the data goes the other way, so very uncertain here.
https://trevorklee.substack.com/p/if-langurs-can-drink-seawater-can Trevor Klee on adaptations for drinking seawater

habryka4 on Seven lessons I didn't learn from election day

This was a really good analysis of a bunch of election stuff that I hadn't seen presented clearly like this anywhere else. If it wasn't about elections and news I would curate it.

maxwell-peterson on Seven lessons I didn't learn from election day

A good post, of interest to all across the political spectrum, marred by the mistake at the end to become explicitly politically opinionated and say bad things about those who voted differently than OP.

sharmake-farah on Seven lessons I didn't learn from election day

The one thing I'll say on the election is that a lot of people are using Kamala Harris's loss to put in their own reasons for why Kamala Harris lost that are essentially ideological propaganda.

Basically only the story that she was doomed from the start because of global backlash against incumbents for inflation matches the evidence best, and a lot of other theories are very much there for ideological purposes.

ben-lang on Seven lessons I didn't learn from election day

I think this question is maybe logically flawed.

Say I have a shuffled deck of cards. You say the probability that the top card is the Ace of Spades is 1/52. I show you the top card, it is the 5 of diamonds. I then ask, knowing what you know now, what probability you should have given.

I picked a card analogy, and you picked a dice one. I think the card one is better in this case, for weird idiosyncratic reasons I give below that might just be irrelevant to the train of thought you are on.

Cards vs Dice: If we could reset the whole planet to its exact state 1 week before the election then we would I think get the same result (I don't think quantum will mess with us in one week). What if we do a coarser grained reset? So if there was a kettle of water at 90 degrees a week before the election that kettle is reset to contain the same volume of water in the same part of my kitchen, and the water is still 90 degrees, but the individual water molecules have different momenta. For some value of "macro" the world is reset to the same macrostate but not the same microstate, it had 1 week before election day. If we imagine this experiment I still think Trump wins every (or almost every) time, given what we know now. For me to think this kind of thermal-level randomness made a difference in one week it would have to have been much closer.

In my head things that change on the coarse-grained reset feel more like unrolled dice, and things that don't more like facedown cards. Although in detail the distinction is fuzzy: it is based on an arbitrary line between micro an macro, and it is time sensitive, because cards that are going to be shuffled in the future are in the same category as dice.

EDIT: I did as asked, and replied without reading your comments on the EA forum. Reading that I think we are actually in complete agreement, although you actually know the proper terms for the things I gestured at.

abramdemski on o1 is a bad idea

Thanks for this response! I agree with the argument. I'm not sure what it would take to ensure CoT faithfulness, but I agree that it is an important direction to try and take things; perhaps even the most promising direction for near-term frontier-lab safety research (given the incentives pushing those labs in o1-ish directions).

mondsemmel on Lao Mein's Shortform

That might very well help, yes. However, two thoughts, neither at all well thought out:

If the Trump administration does fight OpenAI, let's hope Altman doesn't manage to judo flip the situation like he did with the OpenAI board saga, and somehow magically end up replacing Musk or Trump in the upcoming administration...
Musk's own track record on AI x-risk is not great. I guess he did endorse California's SB 1047, so that's better than OpenAI's current position. But he helped found OpenAI, and recently founded another AI company. There's a scenario where we just trade extinction risk from Altman's OpenAI for extinction risk from Musk's xAI.

notfnofn on D0TheMath's Shortform

For example, if you ask mathematicians whether ZFC + not Consistent(ZFC) is consistent, they will say "no, of course not!"

Certainly not a mathematician with any background in logic.

Similarly, if we have the Peano axioms without induction, mathematicians will say that induction should be there, but in fact you cannot prove this fact from within Peano

What exactly do you mean here? That the Peano axioms minus induction do not adequately characterize the natural numbers because they have nonstandard models? Why would I then be surprised that induction (which does characterize the natural numbers) can't be proven from the remaining axioms?

and given induction mathematicians will say transfinite induction should be there.

Transfinite induction is a consequence of ZF that makes sense in the context in sets. Yes, it can prove additional statements about the natural numbers (e.g. goodstein sequences converge), but why would it be added as an axiom when the natural numbers are already characterized up to isomorphism by the Peano axioms? How would you even add it as an axiom in the language of natural numbers? (that last question is non-rhetorical).

abramdemski on Complete Feedback

How about “purely epistemic” means “updated by self-supervised learning”, i.e. the updates (gradients, trader bankrolls, whatever) are derived from “things being true vs false” as opposed to “things being good vs bad”. Right?

You can define it that way, but then I don't think it's highly relevant for this context.

The story I'm telling here is that partial feedback (typically: learning some sort of input-output relation via some sort of latents) always leaves us with undesired hypotheses which we can't rule out using the restricted feedback mechanism.

Reinforcement Learning cannot rule out the wireheading hypothesis or human-manipulation hypothesis.
Updating on things being true or false cannot rule out agentic hypotheses (the inner optimizer problem).

Any sufficiently rich hypotheses space has agentic policies, which can't be ruled out by the feedback. "Purely epistemic" in your sense filters for hypotheses which make good predictions, but this doesn't constrain things to be non-agentic [LW · GW]. The system can learn to use predictions as actions [LW · GW] in some way.

[I learned the term teleosemantics [LW · GW] from you! :) ]

I think it would be fair to define a teleosemantic notion of "purely epistemic" as something like "there is no optimization (anywhere in the system -- 'inner' or 'outer') except optimization for epistemic accuracy".

The obvious application of my main point is that some form of "complete feedback" is a necessary (but insufficient) condition for this.

"Epistemic accuracy" here has to be defined in such a way as to capture the one-way "direction-of-fit" optimization of the map to fit the territory, but never the territory to fit the map. IE the optimization algorithm has to ignore the causal impact of its predictions.

However, I don't particularly endorse this as the correct design choice -- although a system with this property would be relatively safe in the sense of eliminating inner-alignment concerns and (in a sense) outer-alignment concerns, it is doing so by ignoring its impact on the world, which creates its own set of dangers. If such a system were widely deployed and became highly trusted for its predictions, it could stumble into bad self-fulfilling prophecies.

So, in my view, "epistemic" systems should be as transparent as possible with human users about possible multiple-fixed-point issues, try to keep humans in the loop and give the important decisions to humans; but ultimately, we need to view even "purely epistemic" systems as making some important (instrumental) decisions, and have them take some responsibility for making those decisions well instead of poorly.

The original LI paper was in that category, IIUC. The updates (to which traders had more vs less money) are derived from mathematical propositions being true vs false.

I was going to remind you that the paper didn't say how the fixed-point selection works, and we can do that part in an agentic way, [LW · GW] but then you go on to say basically the same thing (with the caveat that where you say "put the LI into a larger system that follows the rule: whatever the expectations are about the AI's own actions, make that actually happen" I would say the more general "put the LI into an environment which somehow reacts to its predictions"):

LI defines a notion of logically uncertain variable, which can be used to represent desires
I would say that they don’t really represent desires. They represent expectations about what’s going to happen, possibly including expectations about an AI’s own actions.
And then you can then put the LI into a larger system that follows the rule: whatever the expectations are about the AI’s own actions, make that actually happen.
The important thing that changes in this situation is that the convergence of the algorithm is underdetermined—you can have multiple fixed points. I can expect to stand up, and then I stand up, and my expectation was validated. No update. I can expect to stay seated, and then I stay seated, and my expectation was validated. No update.
(I don’t think I’m saying anything you don’t already know well.)
Anyway, if you do that, then I guess you could say that the LI’s expectations “can be used” to represent desires … but I maintain that that’s a somewhat confused and unproductive way to think about what’s going on. If I intervene to change the LI variable, it would be analogous to changing habits (what do I expect myself to do ≈ which action plans seem most salient and natural), not analogous to changing desires.
(I think the human brain has a system vaguely like LI, and that it resolves the underdetermination by a separate valence [LW · GW] system, which evaluates expectations as being good vs bad, and applies reinforcement learning to systematically seek out the good ones.)

I don't understand what you're trying to accomplish in these paragraphs. To me you sound sorta like Bob in the following:

Alice: Here's my computer model of an agent.
Bob: Uh oh, that sounds sort of like active inference. How did you represent values? Did you confuse them with beliefs?
Alice: I used floating-point numbers to represent the expected value of a state. Here, look at my code. It's a Q-learning algorithm.
Bob: You realize that "expected value" is a statistics thing, right? That makes it epistemic, not really value-laden in the sense that makes something agentic. It's a prediction of what a number will be. Indeed, we can justify expected values as min-quadratic-loss estimates. That makes them epistemic!
Alice: Well, I agree that expected values aren't automatically "values" in the agentic sense, but look, my code can solve mazes and stuff -- like a rat learning to get cheese.
Bob: Of course I agree that it can be used in an instrumental way, but that's a really misleading way to describe it overall, right? If you changed one Q-value estimated by the system, that would be analogous to changing habits, not desires, right?
Alice: um??? If we agree that it can be used in an instrumental way, then what are you saying is misleading?
Bob: I mean, sure, the human brain does something like this.
Alice: Ok??

It seems possible that you think we have some disagreement that we don't have?

Also, it is easy for end users to build agentlike things out of belieflike things by making queries about how to accomplish things. Thus, we need to train epistemic systems to be responsible about how such queries are answered (as is already apparent in existing chatbots).
I’m not sure that this is coming from a coherent threat model (or else I don’t follow).
If Dr. Evil trains his own AGI, then this whole thing is moot, because he wants the AGI to have accurate beliefs about bioweapons.
If Benevolent Bob trains the AGI and gives API access to Dr. Evil, then Bob can design the AGI to (1) have accurate beliefs about bioweapons, and (2) not answer Dr. Evil’s questions about bioweapons. That might ideally look like what we’re used to in the human world: the AGI says things because it wants to say those things, all things considered, and it doesn’t want Dr. Evil to build bioweapons, either directly or because it’s guessing what Bob would want.

I'm not clear on what you're trying to disagree with here. It sounds like we both agree that if Benevolent Bob builds a powerful "purely epistemic system" (by whatever definition), without limiting its knowledge, then Dr. Evil can misuse it; and we both agree that as a consequence of this, it makes sense to instead build some agency into the system, so that the system can decide not to give users dangerous information.

Possibly you disagree with the claim "it is easy to build agentlike things out of belieflike things"? What I have in mind is a powerful epistemic oracle. As a simple example, let's say it can give highly accurate guesses to mathematically-posed problems. Then Dr. Evil can implement AIXI by feeding in AIXI's mathematical definition, for example. This is the sort of thing I had in mind, but generalized to the nonmathematical case. (EG, "conditional on my owning a super-powerful death ray soon, what actions do I take now")