LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Deceptive agents can collude to hide dangerous features in SAEs
Simon Lermen (dalasnoin) · 2024-07-15T17:07:33.283Z · comments (0)

EA Infrastructure Fund's Plan to Focus on Principles-First EA
Linch · 2023-12-06T03:24:55.844Z · comments (0)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

How to develop a photographic memory 2/3
PhilosophicalSoul (LiamLaw) · 2023-12-30T20:18:14.255Z · comments (7)

[link] align your latent spaces
bhauth · 2023-12-24T16:30:09.138Z · comments (8)

AISC Project: Modelling Trajectories of Language Models
NickyP (Nicky) · 2023-11-13T14:33:56.407Z · comments (0)

Agent membranes/boundaries and formalizing “safety”
Chipmonk · 2024-01-03T17:55:21.018Z · comments (46)

[link] Attention on AI X-Risk Likely Hasn't Distracted from Current Harms from AI
Erich_Grunewald · 2023-12-21T17:24:16.713Z · comments (2)

The Limitations of GPT-4
p.b. · 2023-11-24T15:30:30.933Z · comments (12)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

[question] How are you preparing for the possibility of an AI bust?
Nate Showell · 2024-06-23T19:13:45.247Z · answers+comments (16)

Talk: AI safety fieldbuilding at MATS
Ryan Kidd (ryankidd44) · 2024-06-23T23:06:37.623Z · comments (2)

Am I going insane or is the quality of education at top universities shockingly low?
ChrisRumanov (pseudonymous-ai) · 2023-11-20T03:53:30.056Z · comments (30)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

Quick takes on "AI is easy to control"
So8res · 2023-12-02T22:31:45.683Z · comments (49)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

Taking Into Account Sentient Non-Humans in AI Ambitious Value Learning: Sentientist Coherent Extrapolated Volition
Adrià Moret (Adrià R. Moret) · 2023-12-02T14:07:29.992Z · comments (31)

Losing Metaphors: Zip and Paste
jefftk (jkaufman) · 2023-11-29T20:31:07.464Z · comments (6)

Facebook is Paying Me to Post
jefftk (jkaufman) · 2023-11-14T19:10:07.303Z · comments (5)

[link] How to Upload a Mind (In Three Not-So-Easy Steps)
aggliu · 2023-11-13T18:13:32.893Z · comments (0)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

Thinking in 2D
sarahconstantin · 2024-10-20T19:30:05.842Z · comments (0)

Inferential Game: The Foraging (Ex-)Bandit
abstractapplic · 2024-11-11T16:59:42.058Z · comments (4)

Option control
Joe Carlsmith (joekc) · 2024-11-04T17:54:03.073Z · comments (0)

[link] A brief history of the automated corporation
owencb · 2024-11-04T14:35:04.906Z · comments (1)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (8)

SAEs you can See: Applying Sparse Autoencoders to Clustering
Robert_AIZI · 2024-10-28T14:48:16.744Z · comments (0)

[link] overengineered air filter shelving
bhauth · 2024-11-08T22:04:39.987Z · comments (2)

[link] Care Doesn't Scale
stavros · 2024-10-28T11:57:38.742Z · comments (1)

Sleeping on Stage
jefftk (jkaufman) · 2024-10-22T00:50:07.994Z · comments (3)

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution
Kola Ayonrinde (kola-ayonrinde) · 2024-10-30T22:50:45.642Z · comments (0)

The Foraging (Ex-)Bandit [Ruleset & Reflections]
abstractapplic · 2024-11-14T20:16:21.535Z · comments (3)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

Geometric Utilitarianism (And Why It Matters)
StrivingForLegibility · 2024-05-12T03:41:21.342Z · comments (2)

[link] Let's Design A School, Part 2.1 School as Education - Structure
Sable · 2024-05-02T22:04:30.435Z · comments (2)

Smartphone Etiquette: Suggestions for Social Interactions
Declan Molony (declan-molony) · 2024-06-04T06:01:03.336Z · comments (4)

How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots
Owain_Evans · 2024-03-28T02:34:21.799Z · comments (0)

The Sequences on YouTube
Neil (neil-warren) · 2024-01-07T01:44:39.663Z · comments (9)

[link] Forecasting future gains due to post-training enhancements
elifland · 2024-03-08T02:11:57.228Z · comments (2)

[link] my favourite Scott Sumner blog posts
DMMF · 2024-06-11T14:40:43.093Z · comments (0)

D&D.Sci Hypersphere Analysis Part 3: Beat it with Linear Algebra
aphyer · 2024-01-16T22:44:52.424Z · comments (1)

Causality is Everywhere
silentbob · 2024-02-13T13:44:49.952Z · comments (12)

[link] Emotional issues often have an immediate payoff
Chipmonk · 2024-06-10T23:39:40.697Z · comments (2)

Three Types of Constraints in the Space of Agents
Nora_Ammann · 2024-01-15T17:27:27.560Z · comments (3)

[question] Thoughts on Francois Chollet's belief that LLMs are far away from AGI?
O O (o-o) · 2024-06-14T06:32:48.170Z · answers+comments (17)

AI #57: All the AI News That’s Fit to Print
Zvi · 2024-03-28T11:40:05.435Z · comments (14)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

lukehmiles on Alexander Gietelink Oldenziel's Shortform

As a layman, I have not seen much unrealistic hype. I think the hype-level is just about right.

lukehmiles on Alexander Gietelink Oldenziel's Shortform

You should not bury such a good post in a shortform

lukehmiles on Which evals resources would be good?

Maybe it should be a game that everyone can play

lukehmiles on lukehmiles's Shortform

You didn't ask me to pitch you but I will say a short pitch here for any bystanders. I know how to how find a handful of good people and I know how to let a good chef cook without isolating them either. And I can make a pretty good fried egg if we're starving.

cstinesublime on Ayn Rand’s model of “living money”; and an upside of burnout

As always, I may not be the intended audience, so please excuse my questions that might be patently obvious to the intended audience.

Am I right in understanding a very simplified version of this model is that if you use willpower too much without deriving any net benefits, eventually you'll suffer 'burnout' which really is just a mistrust of using willpower ever, which may have negative effects on other aspects of your life even where willpower is needed like, say, cleaning your house?

Willpower, as I understand it is another word for 'patience' or 'discipline', variously described as the ability to choose to endure pain (physical or emotional). Whether willpower actually exists is a question I won't get into here, let's assume for the sake of this model it does, and fits the description of the ability to choose to endure pain.

For me this sentence I find especially alien to me:

your psyche’s conscious verbal planner “earns” willpower (earns trust with the rest of your psyche) by choosing actions that nourish your fundamental, bottom-up processes in the long run.

what is the "psyche's conscious verbal planner"? I don't know what this is or what part of my mind, person, identity, totality as a organism or anything really that I can equate this label to. Also without examples of what actions are that nourish (again, would cleaning the house, cooking healthy meals be examples?), that are fundamental and those that aren't, it's even harder to pin down what this is and why you attribute willpower to it.

It appears to have the ability to force one's-self to go on a date, which really makes the "verbal" descriptor confusing since a lot of the processes that are involved in going on a date don't feel like they are verbal, lexical, or take the form of the speaker's native language written or spoken. At least in my experience, a lot of the thoughts, feelings, and motivations behind going on a date are not innately verbal for me and if you asked me "why did you agree to see this person?" - even if I felt no fear of embarrassment explaining my reasons - I'd have a hard time putting that into words. Or the words I'd use would be so impossibly vague ("they seem cool") as to suggest that there was a nonverbal reasoning or motivation.

Would this 'conscious verbal planner' also be the part of my mind and body that searches an online store a week later to see if those shoes I want are on special? Or would you attribute that to a different entity?

Is there an unconscious verbal planner?

When I am thinking very carefully about what I'm saying, but not so minutely that I'm thinking about the correct grammatical use, would the grammar I use be my unconscious verbal planner, while the content of my speech be the conscious verbal planner?

A lot of example, for me, of willpower often are nonverbal and come from guilt. Guilt felt as a somatic or bodily thing. I can't verbalize why I feel guilty, although it verbally equates to the words "should" "must" and even "ought" when used as imperatives, not as modals.

ricraz on Why I’m not a Bayesian

Ty for the link but these seem like both clearly bad semantics (e.g. under either of these the second-best hypothesis under consideration might score arbitrarily badly).

gunnar_zarncke on Gunnar_Zarncke's Shortform

Instrumental power-seeking might be less dangerous if the self-model of the agent is large and includes individual humans, groups, or even all of humanity and if we can reliably shape it that way.

It is natural for humans to for form a self-model that is bounded by the body, though it is also common to be only the brain or the mind, and there are other self-models. See also Intuitive Self-Models [? · GW].

It is not clear what the self-model of an LLM agent would be. It could be

the temporary state of the execution of the model (or models),
the persistently running model and its memory state,
the compute resources (CPU/GPU/RAM) allocated to run the model and its collection of support programs,
the physical compute resources in some compute center(s),
the compute center as an organizational structure that includes the staff to maintain and operate not only the machines but also the formal organization (after all, without that, the machines will eventually fail), or
dito but including all the utilities and suppliers to continue to operate it.

There is not as clear a physical boundary as in the human case. But even in the human case, esp. babies depend on caregivers to a large degree.

There are indications that we can shape the self-model of LLMs: Self-Other Overlap: A Neglected Approach to AI Alignment [LW · GW]

noah-birnbaum on The Case For Giving To The Shrimp Welfare Project

I really don't like when people downvote so heavily without giving reasons - think this is nicely argued!

One issue I do have is that Bob Fischer, the conductor of the Rethink study, warned about exactly what you are sorta doing here in being like ah now we can use x amount of shrimp and saying we can trolly problem a human for that many. This is just one contention, but I think the point is important and people willing to take weird/ controversial ideas seriously (especially here!) should take it more seriously!

lukehmiles on lukehmiles's Shortform

Yeah I just wanted to check that nobody is giving away money before I go do the exact opposite thing I've been doing

gunnar_zarncke on Alexander Gietelink Oldenziel's Shortform

This sounds related to my complaint about the YUDKOWSKY + WOLFRAM ON AI RISK debate:

I wish there had been some effort to quantify @stephen_wolfram's "pockets or irreducibility" (section 1.2 & 4.2) because if we can prove that there aren't many or they are hard to find & exploit by ASI, then the risk might be lower.

I got this tweet wrong. I meant if pockets of irreducibility are common and non-pockets are rare and hard to find, then the risk from superhuman AI might be lower. I think Stephen Wolfram's intuition has merit but needs more analysis to be convicing.