LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Without Fundamental Advances, Rebellion and Coup d'État are the Inevitable Outcomes of Dictators & Monarchs Trying to Control Large, Capable Countries
Roko · 2024-01-31T10:14:02.042Z · comments (34)

The economy is mostly newbs (strat predictions)
lukehmiles (lcmgcd) · 2024-02-01T19:15:49.420Z · comments (6)

Weak vs Quantitative Extinction-level Goodhart's Law
VojtaKovarik · 2024-02-21T17:38:15.375Z · comments (1)

[link] Solving alignment isn't enough for a flourishing future
mic (michael-chen) · 2024-02-02T18:23:00.643Z · comments (0)

Uncertainty in all its flavours
Cleo Nardo (strawberry calm) · 2024-01-09T16:21:07.915Z · comments (6)

Cheap Whiteboards!
Johannes C. Mayer (johannes-c-mayer) · 2024-08-08T13:52:59.627Z · comments (2)

[question] Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Dalcy (Darcy) · 2024-08-03T12:39:44.085Z · answers+comments (1)

flowing like water; hard like stone
lsusr · 2024-02-20T03:20:46.531Z · comments (4)

How to develop a photographic memory 2/3
PhilosophicalSoul (LiamLaw) · 2023-12-30T20:18:14.255Z · comments (7)

[question] What prevents SB-1047 from triggering on deep fake porn/voice cloning fraud?
ChristianKl · 2024-09-26T09:17:39.088Z · answers+comments (21)

Domain-specific SAEs
jacob_drori (jacobcd52) · 2024-10-07T20:15:38.584Z · comments (0)

An AI crash is our best bet for restricting AI
Remmelt (remmelt-ellen) · 2024-10-11T02:12:03.491Z · comments (1)

[question] Any real toeholds for making practical decisions regarding AI safety?
lukehmiles (lcmgcd) · 2024-09-29T12:03:08.084Z · answers+comments (6)

[link] Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Giorgi Giglemiani (Rakh) · 2024-09-25T20:37:48.227Z · comments (0)

European Progress Conference
Martin Sustrik (sustrik) · 2024-10-06T11:10:03.819Z · comments (11)

[link] Predicting Influenza Abundance in Wastewater Metagenomic Sequencing Data
jefftk (jkaufman) · 2024-09-23T17:25:58.380Z · comments (0)

Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov (jonathan-kutasov) · 2024-10-05T20:43:36.679Z · comments (2)

Open Thread Fall 2024
habryka (habryka4) · 2024-10-05T22:28:50.398Z · comments (69)

[question] Seeking AI Alignment Tutor/Advisor: $100–150/hr
MrThink (ViktorThink) · 2024-10-05T21:28:16.491Z · answers+comments (3)

5 ways to improve CoT faithfulness
CBiddulph (caleb-biddulph) · 2024-10-05T20:17:12.637Z · comments (8)

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Taras Kutsyk · 2024-09-29T19:37:30.465Z · comments (7)

SAE features for refusal and sycophancy steering vectors
neverix · 2024-10-12T14:54:48.022Z · comments (4)

[link] Positive visions for AI
L Rudolf L (LRudL) · 2024-07-23T20:15:26.064Z · comments (4)

Three Types of Constraints in the Space of Agents
Nora_Ammann · 2024-01-15T17:27:27.560Z · comments (3)

D&D.Sci Hypersphere Analysis Part 3: Beat it with Linear Algebra
aphyer · 2024-01-16T22:44:52.424Z · comments (1)

Vote in the LessWrong review! (LW 2022 Review voting phase)
habryka (habryka4) · 2024-01-17T07:22:17.921Z · comments (9)

Exploring OpenAI's Latent Directions: Tests, Observations, and Poking Around
Johnny Lin (hijohnnylin) · 2024-01-31T06:01:27.969Z · comments (4)

The Sequences on YouTube
Neil (neil-warren) · 2024-01-07T01:44:39.663Z · comments (9)

[link] Manifold Markets
PeterMcCluskey · 2024-02-02T17:48:36.630Z · comments (9)

Why I think it's net harmful to do technical safety research at AGI labs
Remmelt (remmelt-ellen) · 2024-02-07T04:17:15.246Z · comments (24)

Causality is Everywhere
silentbob · 2024-02-13T13:44:49.952Z · comments (12)

Meetup In a Box: Year In Review
Czynski (JacobKopczynski) · 2024-02-14T01:18:28.259Z · comments (0)

Agent membranes/boundaries and formalizing “safety”
Chipmonk · 2024-01-03T17:55:21.018Z · comments (46)

Evaluating Solar
jefftk (jkaufman) · 2024-02-17T21:50:04.783Z · comments (5)

Consequentialism is a compass, not a judge
Neil (neil-warren) · 2024-04-13T10:47:44.980Z · comments (6)

[link] Emotional issues often have an immediate payoff
Chipmonk · 2024-06-10T23:39:40.697Z · comments (2)

[link] my favourite Scott Sumner blog posts
DMMF · 2024-06-11T14:40:43.093Z · comments (0)

[question] Thoughts on Francois Chollet's belief that LLMs are far away from AGI?
O O (o-o) · 2024-06-14T06:32:48.170Z · answers+comments (17)

[question] How are you preparing for the possibility of an AI bust?
Nate Showell · 2024-06-23T19:13:45.247Z · answers+comments (16)

Talk: AI safety fieldbuilding at MATS
Ryan Kidd (ryankidd44) · 2024-06-23T23:06:37.623Z · comments (2)

Links and brief musings for June
Kaj_Sotala · 2024-07-06T10:10:03.344Z · comments (0)

[link] Attention on AI X-Risk Likely Hasn't Distracted from Current Harms from AI
Erich_Grunewald · 2023-12-21T17:24:16.713Z · comments (2)

Optimizing Repeated Correlations
SatvikBeri · 2024-08-01T17:33:23.823Z · comments (1)

LessWrong email subscriptions?
Raemon · 2024-08-27T21:59:56.855Z · comments (6)

The causal backbone conjecture
tailcalled · 2024-08-17T18:50:14.577Z · comments (0)

Just because an LLM said it doesn't mean it's true: an illustrative example
dirk (abandon) · 2024-08-21T21:05:59.691Z · comments (12)

Am I going insane or is the quality of education at top universities shockingly low?
ChrisRumanov (pseudonymous-ai) · 2023-11-20T03:53:30.056Z · comments (30)

What is malevolence? On the nature, measurement, and distribution of dark traits
David Althaus (wallowinmaya) · 2024-10-23T08:41:33.197Z · comments (3)

[link] Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)
mattmacdermott · 2024-09-01T07:46:26.647Z · comments (0)

What is the best argument that LLMs are shoggoths?
JoshuaFox · 2024-03-17T11:36:23.636Z · comments (22)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

steve2152 on Self-prediction acts as an emergent regularizer

One common question is whether the self-modeling task, which involves predicting a layer's own activations, would cause the network to merely learn the identity function. Intuitively, this might seem like an optimal outcome for minimizing the self-modeling loss.

I found this section confusing. If the identity function is the global optimum for self-modeling loss, isn’t it kinda surprising that training doesn’t converge to the identity function? Or does the identity function make it worse at the primary task? If so, why?

[I’m sure this is going to be wrong in some embarrassing way, but what the heck… What I’m imagining right now is as follows. There’s an N×1 activation vector in the second-to-last layer of the DNN, and then a M×N weight matrix constituting the linear transformation, and you multiply them to get a M×1 output layer of the DNN. The first M-N entries of that output layer are the “primary task” outputs, and the bottom N entries are the “self-modeling” outputs, which are compared to the earlier N×1 activation vector mentioned above. And when you’re talking about “identity matrix”, you actually mean that the bottom N×N block of the weight matrix is close to an identity matrix but evidently not quite. (Oops I’m leaving out the bias vector, oh well.) If I’m right so far, then it wouldn’t be the case that the identity matrix makes the thing worse at the primary task, because the top (M-N)×N block of the weight matrix can still be anything. Where am I going wrong?]

tsvibt on johnswentworth's Shortform

I didn't read this carefully--but it's largely irrelevant. Adult editing probably can't have very large effects because developmental windows have passed; but either way the core difficulty is in editor delivery. Germline engineering does not require better gene targets--the ones we already have are enough to go as far as we want. The core difficulty there is taking a stem cell and making it epigenomically competent to make a baby (i.e. make it like a natural gamete or zygote).

austin-chen on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now)

Insofar as you're thinking I said bad people, please don't let yourself make that mistake, I said bad values.

I appreciate you drawing the distinction! The bit about "bad people" was more directed at Tsvi, or possibly the voters who agreevoted with Tsvi.

There's a lot of massively impactful difference in culture and values

Mm, I think if the question is "what accounts for the differences between the EA and rationalist movements today, wrt number of adherents, reputation, amount of influence, achievements" I would assign credit in the ratio of ~1:3 to differences in (values held by individuals):systems. Where systems are roughly: how the organizations are set up, how funding and information flows through the ecosystem.

(As I write this, I realize that maybe even caring about adherents/reputation/influence/achievement in the first place is an impact-based, EA-frame, and the thing that Ben cares about is more like "what accounts for the differences in their philosophies or gestalt of what it feels like to be in the movement"; I feel like I'm lowkey failing an ITT here...)

tsvibt on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now)

Ben's responses largely cover what I would have wanted to say. But on a meta note: I wrote specifically

I think a hypothesis that does have to be kept in mind is that some people don't care.

I do also think the hypothesis is true (and it's reasonable for this thread to discuss that claim, of course).

But the reason I said it that way, is that it's a relatively hard hypothesis to evaluate. You'd probably have to have several long conversations with several different people, in which you successfully listen intensely to who they are / what they're thinking / how they're processing what you say. Probably only then could you even have a chance at reasonably concluding something like "they actually don't care about X", as distinct from "they know something that implies X isn't so important here" or "they just don't get that I'm talking about X" or "they do care about X but I wasn't hearing how" or "they're defensive in this moment, but will update later" or "they just hadn't heard why X is important (but would be open to learning that)", etc.

I agree that it's a potentially mindkilly hypothesis. And because it's hard to evaluate, the implicature of assertions about it is awkward--I wanted to acknowledge that it would be difficult to find a consensus belief state, and I wanted to avoid implying that the assertion is something we ought to be able to come to consensus about right now. And, more simply, it would take substantial work to explain the evidence for the hypothesis being true (in large part because I'd have to sort out my thoughts). For these reasons, my implied request is less like "let's evaluate this hypothesis right now", and more like "would you please file this hypothesis away in your head, and then if you're in a long conversation, on the relevant topic with someone in the relevant category, maybe try holding up the hypothesis next to your observations and seeing if it explains things or not".

In other words, it's a request for more data and a request for someone to think through the hypothesis more. It's far from perfectly neutral--if someone follows that request, they are spending their own computational resources and thereby extending some credit to me and/or to the hypothesis.

steve2152 on Self-prediction acts as an emergent regularizer

As a toy-model point of comparison, here’s one thing that could hypothetically happen during “self-modeling” of the activations of layer L: (1) the model always guesses that the activations of layer L are all 0; (2) gradient descent sculpts the model to have very small activations in layer L.

In this scenario, it’s not really “self-modeling” at all, but rather a roundabout way to implement “activation regularization” specifically targeted to layer L.

In “activation regularization”, the auxiliary loss term is just , whereas in your study it’s $| a -^a |^{2}$ (where $a$ is the layer L activation vector and $^a$ is the self-modeling guess vector). So activation regularization might be a better point of comparison than the weight regularization that you brought up in the appendix. E.g. activation regularization does have the property that it “adapts based on the structure and distribution of the input data”.

I’d be curious whether you get similar “network complexity” (SD & RLCT) results with plain old activation regularization. That might be helpful for disentangling the activation regularization from bona fide self-modeling.

(I haven’t really thought through the details. Is there batch norm? If so, how does that interact with what I wrote? Also, in my example at the top, I could have said “the model always guesses that the activations are some fixed vector V” instead of “…that the activations are all 0”. Does that make any difference? I dunno.)

Sorry if this is all stupid, or in the paper somewhere.

o-o on O O's Shortform

https://x.com/arcprize/status/1849225898391933148?s=46&t=lZJAHzXMXI1MgQuyBgEhgA

My read of the events. Anthropic is trying to raise money and rushed out a half baked model.

3.5 opus has not yet had the desired results. 3.5 sonnet, being easier to iterate on, was tuned to beat OpenAI’s model on some arbitrary benchmarks in an effort to wow investors.

With the failed run of Opus, they presumably tried to get o1 like reasoning results or some agentic breakthrough. The previous 3.5s was also particularly good because of a fluke of the training run rng (same as gpt4-0314), which makes it harder for iterations to beat it.

They are probably now rushing to scale inference time compute. I wonder if they tried doing something with steering vectors initially for 3.5 opus.

gwern on The Mask Comes Off: At What Price?

No, that's what I think too: they were turning down investors, even, excluding them from the upround. The conditionality was probably not necessary at all. But it does serve a valuable purpose for the inevitable lawsuit.

gwern on johnswentworth's Shortform

With SNPs, there's tens of thousands of different SNPs which would each need to be targeted differently. With high copy sequences, there's a relatively small set of different sequences.

No, rare variants are no silver bullet here. There's not a small set, there's a larger set - there would probably be combinatorially more rare variants because there are so many ways to screw up genomes beyond the limited set of ways defined by a single-nucleotide polymorphism, which is why it's hard to either select on or edit rare variants: they have larger (harmful) effects due to being rare, yes, and account for a large chunk of heritability, yes, but there are so many possible rare mutations that each one has only a few instances worldwide which makes them hard to estimate correctly via pure GWAS-style approaches. And they tend to be large or structural and so extremely difficult to edit safely compared to editing a single base-pair. (If it's hard to even sequence a CNV, how are you going to edit it?)

They definitely contribute a lot of the missing heritability (see GREML-KIN), but that doesn't mean you can feasibly do much about them. If there are tens of millions of possible rare variants, across the entire population, but they are present in only a handful of individuals a piece (as estimated by the GREML-KIN variance components where the family-level accounts for a lot of variance), it's difficult to estimate their effect to know if you want to select against or edit them in the first place. (Their larger effect sizes don't help you nearly as much as their rarity hurts you.)

So this is why if you read the CNV studies and you look at the hits they identify, and how many subjects are covered by the identified hits, you find that like, maybe 2% of the cohort will have one of those specific identified hits and lose 2 IQ points or gain 2 kg of fat etc. So you can see how that would work out in embryo selection: you'd be able to avoid that loss, which is meaningful! ...in a tiny fraction of all embryos. On average, you'd just sequence them all, find no known pathogenic variant, and shrug, and use the SNP PGS like usual, having gained nothing.

Also, of course, WGS is substantially more expensive than SNP genotyping and more difficult to do on embryos.

If the genetic architecture had worked out otherwise, if there had instead been a lot of rare mutations which increased intelligence, then life would be a lot more convenient. Instead, it's a lot of 'sand in the gears', and once you move past the easy specks of sand, they all become their own special little snowflakes.

This is why rare variants are not too promising, although they are the logical place to go after you start to exhaust common SNPs. You probably have to find an alternative approach like directly modeling or predicting the pathogenicity of a rare variant from trying to understand its biological effects, which is hard to do and hard to quantify or predict progress in. (You can straightforwardly model GWAS on common SNPs and how many samples you need and what variance your PGS will get, but predicting progress of pathogenicity predictors has no convenient approach.) Similarly, you can try very broad crude approaches like 'select embryos with the fewest de novo mutations'... but then you lose most of the possible variance and it'll add little.

eggsyntax on The Mask Comes Off: At What Price?

Possibly too cynical, but I find myself wondering whether the conditionality was in fact engineered by OAI in order to achieve this purpose. My impression is that there are a lot of VCs shouting 'Take my money!' in the general direction of OpenAI and/or Altman who wouldn't have demanded the restructuring.

lorec on A metaphor: what "green lights" for AGI would look like

Changed to "RLHF as actually implemented." I'm aware of its theoretical origin story with Paul Christiano; I'm going a little "the purpose of a system is what it does".