Posts

Spatial attention as a “tell” for empathetic simulation? 2024-04-26T15:10:58.040Z
A couple productivity tips for overthinkers 2024-04-20T16:05:50.332Z
“Artificial General Intelligence”: an extremely brief FAQ 2024-03-11T17:49:02.496Z
Some (problematic) aesthetics of what constitutes good work in academia 2024-03-11T17:47:28.835Z
Woods’ new preprint on object permanence 2024-03-07T21:29:57.738Z
Social status part 2/2: everything else 2024-03-05T16:29:19.072Z
Social status part 1/2: negotiations over object-level preferences 2024-03-05T16:29:07.143Z
Four visions of Transformative AI success 2024-01-17T20:45:46.976Z
Deceptive AI ≠ Deceptively-aligned AI 2024-01-07T16:55:13.761Z
[Valence series] Appendix A: Hedonic tone / (dis)pleasure / (dis)liking 2023-12-20T15:54:17.131Z
[Valence series] 5. “Valence Disorders” in Mental Health & Personality 2023-12-18T15:26:29.970Z
[Valence series] 4. Valence & Social Status 2023-12-15T14:24:41.040Z
[Valence series] 3. Valence & Beliefs 2023-12-11T20:21:30.570Z
[Valence series] 2. Valence & Normativity 2023-12-07T16:43:49.919Z
[Valence series] 1. Introduction 2023-12-04T15:40:21.274Z
Thoughts on “AI is easy to control” by Pope & Belrose 2023-12-01T17:30:52.720Z
I’m confused about innate smell neuroanatomy 2023-11-28T20:49:13.042Z
8 examples informing my pessimism on uploading without reverse engineering 2023-11-03T20:03:50.450Z
Late-talking kid part 3: gestalt language learning 2023-10-17T02:00:05.182Z
“X distracts from Y” as a thinly-disguised fight over group status / politics 2023-09-25T15:18:18.644Z
A Theory of Laughter—Follow-Up 2023-09-14T15:35:18.913Z
A Theory of Laughter 2023-08-23T15:05:59.694Z
Model of psychosis, take 2 2023-08-17T19:11:17.386Z
My checklist for publishing a blog post 2023-08-15T15:04:56.219Z
Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions 2023-07-19T14:26:05.675Z
Thoughts on “Process-Based Supervision” 2023-07-17T14:08:57.219Z
Munk AI debate: confusions and possible cruxes 2023-06-27T14:18:47.694Z
My side of an argument with Jacob Cannell about chip interconnect losses 2023-06-21T13:33:49.543Z
LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem 2023-05-08T19:35:19.180Z
Connectomics seems great from an AI x-risk perspective 2023-04-30T14:38:39.738Z
AI doom from an LLM-plateau-ist perspective 2023-04-27T13:58:10.973Z
Is “FOXP2 speech & language disorder” really “FOXP2 forebrain fine-motor crappiness”? 2023-03-23T16:09:04.528Z
EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes 2023-03-23T14:32:53.800Z
Plan for mediocre alignment of brain-like [model-based RL] AGI 2023-03-13T14:11:32.747Z
Why I’m not into the Free Energy Principle 2023-03-02T19:27:52.309Z
Why I’m not working on {debate, RRM, ELK, natural abstractions} 2023-02-10T19:22:37.865Z
Heritability, Behaviorism, and Within-Lifetime RL 2023-02-02T16:34:33.182Z
Schizophrenia as a deficiency in long-range cortex-to-cortex communication 2023-02-01T19:32:24.447Z
“Endgame safety” for AGI 2023-01-24T14:15:32.783Z
Thoughts on hardware / compute requirements for AGI 2023-01-24T14:03:39.190Z
Note on algorithms with multiple trained components 2022-12-20T17:08:24.057Z
More notes from raising a late-talking kid 2022-12-20T02:13:01.018Z
My AGI safety research—2022 review, ’23 plans 2022-12-14T15:15:52.473Z
The No Free Lunch theorem for dummies 2022-12-05T21:46:25.950Z
My take on Jacob Cannell’s take on AGI safety 2022-11-28T14:01:15.584Z
Me (Steve Byrnes) on the “Brain Inspired” podcast 2022-10-30T19:15:07.884Z
What does it take to defend the world against out-of-control AGIs? 2022-10-25T14:47:41.970Z
Quick notes on “mirror neurons” 2022-10-04T17:39:53.144Z
Book review: “The Heart of the Brain: The Hypothalamus and Its Hormones” 2022-09-27T13:20:51.434Z
Thoughts on AGI consciousness / sentience 2022-09-08T16:40:34.354Z

Comments

Comment by Steven Byrnes (steve2152) on Spatial attention as a “tell” for empathetic simulation? · 2024-04-26T19:39:09.383Z · LW · GW

If I’m looking up at the clouds, or at a distant mountain range, then everything is far away (the ground could be cut off from my field-of-view)—but it doesn’t trigger the sensations of fear-of-heights, right? Also, I think blind people can be scared of heights?

Another possible fear-of-heights story just occurred to me—I added to the post in a footnote, along with why I don’t believe it.

Comment by Steven Byrnes (steve2152) on Transfer Learning in Humans · 2024-04-22T13:58:20.825Z · LW · GW

From when I've talked with people from industry, they don't seem at all interested in tracking per-employee performance (e.g. Google isn't running RCTs on their engineers to increase their coding performance, and estimates for how long projects will take are not tracked & scored). 

FWIW Joel Spolsky suggests that people managing software engineers should have detailed schedules, and says big companies have up-to-date schedules, and built a tool to leverage historical data for better schedules. At my old R&D firm, people would frequently make schedules and budgets for projects, and would be held to account if their estimates were bad, and I got a strong impression that seasoned employees tended to get better at making accurate schedules and budgets over time. (A seasoned employee suggested to me a rule-of-thumb for novices, that I should earnestly try to make an accurate schedule, then go through the draft replacing the word “days” with “weeks”, and “weeks” with “months”, etc.) (Of course it’s possible for firms to not be structured such that people get fast and frequent feedback on the accuracy of their schedules and penalties for doing a bad job, in which case they probably won’t get better over time.)

I guess what’s missing is (1) systemizing scheduling so that it’s not a bunch of heuristics in individual people’s heads (might not be possible), (2) intervening on employee workflows etc. (e.g. A/B testing) and seeing how that impacts productivity.

Practice testing

IIUC the final “learning” was assessed via a test. So you could rephrase this as, “if you do the exact thing X, you’re liable to get better at doing X”, where here X=“take a test on topic Y”. (OK, it generalized “from simple recall to short answer inference tests” but that’s really not that different.)

I'm also a little bit surprised that keywords and mnemonics don't work (since they are used very often by competitive mnemonists)

I invent mnemonics all the time, but normal people still need spaced-repetition or similar to memorize the mnemonic. The mnemonics are easier to remember (that’s the point) but “easier” ≠ effortless.

 

As another point, I think a theme that repeatedly comes up is that people are much better at learning things when there’s an emotional edge to them—for example:

  • It’s easier to remember things if you’ve previously brought them up in an argument with someone else.
  • It’s easier to remember things if you’ve previously gotten them wrong in public and felt embarrassed.
  • It’s easier to remember things if you’re really invested in and excited by a big project and figuring this thing out will unblock the project.

This general principle makes obvious sense from an evolutionary perspective (it’s worth remembering a lion attack, but it’s not worth remembering every moment of a long uneventful walk), and I think it’s also pretty well understood neuroscientifically (physiological arousal → more norepinephrine, dopamine, and/or acetylcholine → higher learning rates … something like that).

 

As another point, I’m not sure there’s any difference between “far transfer” and “deep understanding”. Thus, the interventions that you said were helpful for far transfer seem to be identical to the interventions that would lead to deep understanding / familiarity / facility with thinking about some set of ideas. See my comment here.

Comment by Steven Byrnes (steve2152) on A couple productivity tips for overthinkers · 2024-04-21T12:57:01.156Z · LW · GW

Yeah some of my to-do items are of the form "skim X". Inside the "card" I might have a few words about how I originally came across X and what I'm hoping to get out of skimming it.

Comment by Steven Byrnes (steve2152) on A couple productivity tips for overthinkers · 2024-04-21T12:27:16.445Z · LW · GW

It just refers to the fact that there are columns that you drag items between. I don't even really know how a "proper" kanban works.

If a new task occurs to me in the middle of something else, I'll temporarily put it in a left (high-priority) column, just so I don't forget it, and then later when I'm at my computer and have a moment to look at it, I might decide to drag it to a right (low-priority) column instead of doing it.

Comment by Steven Byrnes (steve2152) on Express interest in an "FHI of the West" · 2024-04-20T16:26:26.406Z · LW · GW

Such an unambitious, narrowly-scoped topic area?? There may be infinitely many parallel universes in which we can acausally improve life … you’re giving up  of the value at stake before even starting :)

Comment by Steven Byrnes (steve2152) on Generalized Stat Mech: The Boltzmann Approach · 2024-04-12T23:50:15.709Z · LW · GW

I always thought of  as the exact / “real” definition of entropy, and  as the specialization of that “exact” formula to the case where each microstate is equally probable (a case which is rarely exactly true but often a good approximation). So I found it a bit funny that you only mention the second formula, not the first. I guess you were keeping it simple? Or do you not share that perspective?

Comment by Steven Byrnes (steve2152) on Ackshually, many worlds is wrong · 2024-04-12T20:53:45.904Z · LW · GW

I just looked up “many minds” and it’s a little bit like what I wrote here, but described differently in ways that I think I don’t like. (It’s possible that Wikipedia is not doing it justice, or that I’m misunderstanding it.) I think minds are what brains do, and I think brains are macroscopic systems that follow the laws of quantum mechanics just like everything else in the universe.

What property distinguished a universe where "Harry found himself in a tails branch" and a universe where "Harry found himself in a heads branch"?

Those both happen in the same universe. Those Harry's both exist. Maybe you should put aside many-worlds and just think about Parfit’s teletransportation paradox. I think you’re assuming that “thread of subjective experience” is a coherent concept that satisfies all the intuitive properties that we feel like it should have, and I think that the teletransportation paradox is a good illustration that it’s not coherent at all, or at the very least, we should be extraordinarily cautious when making claims about the properties of this alleged thing you call a “thread of subjective experience” or “thread of consciousness”. (See also other Parfit thought experiments along the same lines.)

I don’t like the idea where we talk about what will happen to Harry, as if that has to have a unique answer. Instead I’d rather talk about Harry-moments, where there’s a Harry at a particular time doing particular things and full of memories of what happened in the past. Then there are future Harry-moments. We can go backwards in time from a Harry-moment to a unique (at any given time) past Harry-moment corresponding to it—after all, we can inspect the memories in future-Harry-moment’s head about what past-Harry was doing at that time (assuming there were no weird brain surgeries etc). But we can’t uniquely go in the forward direction: Who’s to say that multiple future-Harry-moments can’t hold true memories of the very same past-Harry-moment?

Here I am, right now, a Steve-moment. I have a lot of direct and indirect evidence of quantum interactions that have happened in the past or are happening right now, as imprinted on my memories, surroundings, and so on. And if you a priori picked some possible property of those interactions that (according to the Born rule) has 1-in-a-googol probability to occur in general, then I would be delighted to bet my life’s savings that this property is not true of my current observations and memories. Obviously that doesn’t mean that it’s literally impossible.

Comment by Steven Byrnes (steve2152) on Ackshually, many worlds is wrong · 2024-04-12T20:14:33.927Z · LW · GW

I wrote “flipping an unbiased coin” so that’s 50/50.

Comment by Steven Byrnes (steve2152) on Ackshually, many worlds is wrong · 2024-04-12T12:59:04.666Z · LW · GW

there's some preferred future "I" out of many who is defined not only by observations he receives, but also by being a preferred continuation of subjective experience defined by an unknown mechanism

I disagree with this part—if Harry does the quantum equivalent of flipping an unbiased coin, then there’s a branch of the universe’s wavefunction in which Harry sees heads and says “gee, isn’t it interesting that I see heads and not tails, I wonder how that works, hmm why did my thread of subjective experience carry me into the heads branch?”, and there’s also a branch of the universe’s wavefunction in which Harry sees tails and says “gee, isn’t it interesting that I see tails and not heads, I wonder how that works, hmm why did my thread of subjective experience carry me into the tails branch?”. I don’t think either of these Harrys is “preferred”.

I don’t think there’s any extra “complexity penalty” associated with the previous paragraph: the previous paragraph is (I claim) just a straightforward description of what would happen if the universe and everything in it (including Harry) always follows the Schrodinger equation—see Quantum Mechanics In Your Face for details.

I think we deeply disagree about the nature of consciousness, but that’s a whole can of worms that I really don’t want to get into in this comment thread.

doesn't strike me as "feeling more natural"

Maybe you’re just going for rhetorical flourish, but my specific suggestion with the words “feels more natural” in the context of my comment was: the axiom “I will find myself in a branch of amplitude approaching 0 with probability approaching 0” “feels more natural” than the axiom “I will find myself in a branch of amplitude c with probability ”. That particular sentence was not a comparison of many-worlds with non-many-worlds, but rather a comparison of two ways to formulate many-worlds. So I think your position is that you find neither of those to “feel natural”.

Comment by Steven Byrnes (steve2152) on Ackshually, many worlds is wrong · 2024-04-12T10:35:37.224Z · LW · GW

Quantum Mechanics In Your Face talk by Sidney Coleman, starting slide 17 near the end. The basic idea is to try to operationalize how someone might test the Born rule—they take a bunch of quantum measurements, one after another, and they subject their data to a bunch of randomness tests and so on, and then they eventually declare “Born rule seems true” or “Born rule seems false” after analyzing the data. And you can show that the branches in which this person declares “Born rule seems false” have collective amplitude approaching zero, in the limit as their test procedure gets better and better (i.e. as they take more and more measurements).

Comment by Steven Byrnes (steve2152) on Ackshually, many worlds is wrong · 2024-04-12T00:19:33.006Z · LW · GW

(Warning that I may well be misunderstanding this post.)

For any well-controlled isolated system, if it starts in a state , then at a later time it will be in state  where U is a certain deterministic unitary operator. So far this is indisputable—you can do quantum state tomography, you can measure the interference effects, etc. Right?

OK, so then you say: “Well, a very big well-controlled isolated system could be a box with my friend Harry and his cat in it, and if the same principle holds, then there will be deterministic unitary evolution from  into , and hey, I just did the math and it turns out that  will have a 50/50 mix of ‘Harry sees his cat alive’ and ‘Harry sees his cat dead and is sad’.” This is beyond what’s possible to directly experimentally verify, but I think it should be a very strong presumption by extrapolating from the first paragraph. (As you say, “quantum computers prove larger and larger superpositions to be stable”.)

OK, and then we take one more step by saying “Hey what if I’m in the well-controlled isolated system?” (e.g. the “system” in question is the whole universe). From my perspective, it’s implausible and unjustified to do anything besides say that the same principle holds as above: if the universe (including me) starts in a state , then at a later time it will be in state  where U is a deterministic unitary operator.

…And then there’s an indexicality issue, and you need another axiom to resolve it. For example: “as quantum amplitude of a piece of the wavefunction goes to zero, the probability that I will ‘find myself’ in that piece also goes to zero” is one such axiom, and equivalent (it turns out) to the Born rule. It’s another axiom for sure; I just like that particular formulation because it “feels more natural” or something.

I think the place anti-many-worlds-people get off the boat is this last step, because there’s actually two attitudes:

  • My attitude is: there’s a universe following orderly laws, and the universe was there long before there were any people around to observe it, and it will be there long after we’re gone, and the universe happened to spawn people and now we can try to study and understand it.
  • An opposing attitude is: the starting point is my first-person subjective mind, looking out into the universe and making predictions about what I’ll see. So my perspective is special—I need not be troubled by the fact that I claim that there are many-Harrys when Harry’s in the box and I’m outside it, but I also claim that there are not many-me’s when I’m in the box. That’s not inconsistent, because I’m the one generating predictions for myself, so the situation isn’t symmetric. If I see that the cat is dead, then the cat is dead, and if you outside the well-isolated box say “there’s a branch of the wavefunction where you saw that the cat’s alive”, then I’ll say “well, from my perspective, that alleged branch is not ‘real’; it does not ‘exist’”. In other words, when I observed the cat, I “collapsed my wavefunction” by erasing the part of the (alleged) wavefunction that is inconsistent with my indexical observations, and then re-normalizing the wavefunction.

I’m really unsympathetic to the second bullet-point attitude, but I don’t think I’ve ever successfully talked somebody out of it, so evidently it’s a pretty deep gap, or at any rate I for one am apparently unable to communicate past it.

maybe the pilot-wave model is directionally correct in the sense of informing us about the nature of knowledge?

FWIW last I heard, nobody has constructed a pilot-wave theory that agrees with quantum field theory (QFT) in general and the standard model of particle physics in particular. The tricky part is that in QFT there’s observable interference between states that have different numbers of particles in them, e.g. a virtual electron can appear then disappear in one branch but not appear at all in another, and those branches have easily-observable interference in collision cross-sections etc. That messes with the pilot-wave formalism, I think. 

Comment by Steven Byrnes (steve2152) on Is LLM Translation Without Rosetta Stone possible? · 2024-04-11T01:23:30.872Z · LW · GW

I think the standard technical term for what you’re talking about is “unsupervised machine translation”. Here’s a paper on that, for example, although it’s not using the LLM approach you propose. (I have no opinion about whether the LLM approach you propose would work or not.)

Comment by Steven Byrnes (steve2152) on How We Picture Bayesian Agents · 2024-04-08T20:42:04.671Z · LW · GW

In practice minds mostly seem to converge on quite similar latents

Yeah to some extent, although it’s stacking the deck when the minds speak the same language and grew up in the same culture. If you instead go to remote tribes, you find plenty of untranslatable words—or more accurately, words that translate to some complicated phrase that you’ve probably never thought about before. (I dug up an example for §4.3 here, in reference to Lisa Feldman Barrett’s extensive chronicling of exotic emotion words from around the world.)

(That’s not necessarily relevant to alignment because we could likewise put AGIs in a training environment with lots of English-language content, and then the AGIs would presumably get English-language concepts.)

“inconsistent beliefs”

You were talking about values and preferences in the previous paragraph, then suddenly switched to “beliefs”. Was that deliberate?

Comment by Steven Byrnes (steve2152) on Open Thread Spring 2024 · 2024-04-02T17:00:40.285Z · LW · GW

I’m in the market for a new productivity coach / accountability buddy, to chat with periodically (I’ve been doing one ≈20-minute meeting every 2 weeks) about work habits, and set goals, and so on. I’m open to either paying fair market rate, or to a reciprocal arrangement where we trade advice and promises etc. I slightly prefer someone not directly involved in AGI safety/alignment—since that’s my field and I don’t want us to get nerd-sniped into object-level discussions—but whatever, that’s not a hard requirement. You can reply here, or DM or email me. :) update: I’m all set now

Comment by Steven Byrnes (steve2152) on Coherence of Caches and Agents · 2024-04-02T13:15:19.804Z · LW · GW

Now, a system which doesn't satisfy the coherence conditions could still maximize some other kind of utility function - e.g. utility over whole trajectories, or some kind of discounted sum of utility at each time-step, rather than utility over end states. But that's not very interesting, in general; any old system can be interpreted as maximizing some utility function over whole trajectories (i.e. the utility function which assigns high score to whatever the system actually does, and low score to everything else).

It’s probably not intended, but I think this wording vaguely implies a false dichotomy between “a thing (approximately) coherently pursues a long-term goal” and “an uninteresting thing like a rock”. There are other options like “Bob wants to eventually get out of debt, but Bob also wants to always act with honor and integrity”. See my post Consequentialism & Corrigibility.

Relatedly, I don’t think memetics is the only reason humans don’t approximately-coherently pursue states of the world in the distant future. (You didn’t say it was, but sorta gave that vibe.) For one thing, something can be pleasant or unpleasant right now. For another thing, the value function is defined and updated in conjunction with a flawed and incomplete world-model, as in your Pointers Problem post.

Comment by Steven Byrnes (steve2152) on [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate · 2024-03-30T21:08:08.938Z · LW · GW

I’m interested in Metacelsus’s answer.

My take is: I really haven’t been following the lab leak stuff. The point of my comment was to bring this hypothesis to the attention of people who have, and hopefully get some takes from them. As I understand it:

  • We know for sure that miners went into a cave, the same cave where btw one of the closest known wild relatives of COVID was later sampled
  • We know for sure that the miners got sick with COVID-like symptoms, some for 4+ months
  • We know for sure that samples (including posthumous samples) from those sick miners were sent to WIV, and that the researchers still had access to those samples into 2020

I think that’s more than enough to at least raise the Mojiang Miner Passage theory to consideration. Figuring out whether the theory is actually true or not would require a lot more beyond that, e.g. arguments about the exact genetic code of the furin cleavage site and all this other stuff which is way outside my area of expertise.  :)

Comment by Steven Byrnes (steve2152) on [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate · 2024-03-29T18:25:55.561Z · LW · GW

[genetic sequence analysis] is stupid because none of the people involved had the technical understanding required to even interpret papers on the topic.

The two judges were:

  • Will van Treuren, a pharmaceutical entrepreneur with a PhD from Stanford and a background in bacteriology and immunology.
  • Eric Stansifer, an applied mathematician with a PhD from MIT and experience in mathematical virology.

Do you think the judges lack technical understanding to interpret papers on genetic sequence analysis, or do you not count the judges as “involved”, or both, or something else?

Comment by Steven Byrnes (steve2152) on [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate · 2024-03-28T18:40:56.824Z · LW · GW

Way back in 2020 there was an article A Proposed Origin For SARS-COV-2 and the COVID-19 Pandemic, which I read after George Church tweeted it (!) (without comment or explanation). Their proposal (they call it "Mojiang Miner Passage" theory) in brief was that it WAS a lab leak but NOT gain-of-function. Rather, in April 2012, six workers in a "Mojiang mine fell ill from a mystery illness while removing bat faeces. Three of the six subsequently died." Their symptoms were a perfect match to COVID, and two were very sick for more than four months.

The proposal is that the virus spent those four months adapting to life in human lungs, including (presumably) evolving the furin cleavage site. And then (this is also well-documented) samples from these miners were sent to WIV. The proposed theory is that those samples sat in a freezer at WIV for a few years while WIV was constructing some new lab facilities, and then in 2019 researchers pulled out those samples for study and infected themselves.

I like that theory! I’ve liked it ever since 2020! It seems to explain many of the contradictions brought up by both sides of this debate—it’s compatible with Saar’s claim that the furin cleavage site is very different from what’s in nature and seems specifically adapted to humans, but it’s also compatible with Peter’s claim that the furin cleavage site looks weird and evolved. It’s compatible with Saar’s claim that WIV is suspiciously close to the source of the outbreak, but it’s also compatible with Peter’s claim that WIV might not have been set up to do serious GoF experiments. It’s compatible with the data comparing COVID to other previously-known viruses (supposedly). Etc.

Old as this theory is, the authors are still pushing it and they claim that it’s consistent with all the evidence that’s come out since then (see author’s blog). But I’m sure not remotely an expert, and would be interested if anyone has opinions about this. I’m still confused why it’s never been much discussed.

Comment by Steven Byrnes (steve2152) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-28T16:40:41.115Z · LW · GW

I think this is a perfectly valid argument for why NYT shouldn't publish it, it just doesn't seem very strong or robust… Like, if the NYT did go out and count the number of pebbles on your road, then yes there's an opportunity cost to this etc., which makes it a pretty unnecessary thing to do, but it's not like you'd have any good reason to whip out a big protest or anything.

The context from above is that we’re weighing costs vs benefits of publishing the name, and I was pulling out the sub-debate over what the benefits are (setting aside the disagreement about how large the costs are).

I agree that “the benefits are ≈0” is not a strong argument that the costs outweigh the benefits in and of itself, because maybe the costs are ≈0 as well. If a journalist wants to report the thickness of Scott Alexander’s shoelaces, maybe the editor will say it’s a waste of limited wordcount, but the journalist could say “hey it’s just a few words, and y’know, it adds a bit of color to the story”, and that’s a reasonable argument: the cost and benefit are each infinitesimal, and reasonable people can disagree about which one slightly outweighs the other.

But “the benefits are ≈0” is a deciding factor in a context where the costs are not infinitesimal. Like if Scott asserts that a local gang will beat him senseless if the journalist reports the thickness of his shoelaces, it’s no longer infinitesimal costs versus infinitesimal benefits, but rather real costs vs infinitesimal benefits.

If the objection is “maybe the shoelace thickness is actually Scott’s dark embarrassing secret that the public has an important interest in knowing”, then yeah that’s possible and the journalist should certainly look into that possibility. (In the case at hand, if Scott were secretly SBF’s brother, then everyone agrees that his last name would be newsworthy.) But if the objection is just “Scott might be exaggerating, maybe the gang won’t actually beat him up too badly if the shoelace thing is published”, then I think a reasonable ethical journalist would just leave out the tidbit about the shoelaces, as a courtesy, given that there was never any reason to put it in in the first place.

Comment by Steven Byrnes (steve2152) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T16:45:51.399Z · LW · GW

But can you imagine writing a newspaper article where you are reporting on the actions of an anonymous person? Its borderline nonsense. 

I can easily imagine writing a newspaper article about how Charlie Sheen influenced the film industry, that nowhere mentions the fact that his legal name is Carlos Irwin Estévez. Can’t you? Like, here’s one.

(If my article were more biographical in nature, with a focus on Charlie Sheen’s childhood and his relationship with his parents, rather than his influence on the film industry, then yeah I would presumably mention his birth name somewhere in my article in that case. No reason not to.)

Comment by Steven Byrnes (steve2152) on Have we really forsaken natural selection? · 2024-03-27T13:41:04.435Z · LW · GW

[partly copied from here]

  • The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in basement reality.
  • The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing genes for how many future copies are encoded in literal DNA molecules in either basement reality or accurate simulations.
  • The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing genes for how many future copies are encoded in DNA molecules, or any other format that resembles DNA functionally, regardless of whether it resembles DNA chemically or mechanistically.
  • The history of evolution to date is exactly what you’d get from a particular kind of RL algorithm optimizing ‘things’ for ‘their future existence & proliferation’ in some broad sense (or something like that)
  • [infinitely many more things like that]

If future humans switch from DNA to XNA, or upload themselves into simulations, or imprint their values on AI successors, or whatever, then the future would be high-reward according to some of those RL algorithms and the future would be zero-reward according to others of those RL algorithms.

In other words, one “experiment” is simultaneously providing evidence about what the results look like for infinitely many different RL algorithms. Lucky us.

(Related to: “goal misgeneralization”.)

I don’t think it’s productive to just stare at the list of bullet points and try to find the one that corresponds to the “broadest, truest” essence of natural selection. What does that even mean? Why is it relevant to this discussion?

I do think it is potentially productive to argue that the evidence from some of these bullet-point “experiments” is more relevant to AI alignment than the evidence from others of these bullet-point “experiments”. But to make that argument, one needs to talk more specifically about what AI alignment will look like, and argue on that basis that some of the above bullet point RL algorithms are more disanalogous to AI alignment than others. This kind of argument wouldn’t be talking about which bullet point is “reasonable” or “the true essence of natural selection”, but rather about which bullet point is the tightest analogy to the situation where future programmers are developing powerful AI.

(And FWIW my answer to the latter is: none of the above—I think all of those bullet points are sufficiently disanalogous to AI alignment that we don’t really learn anything from them, except that they serve as an existence proof illustration of the extremely weak claim that inner misalignment in RL is not completely impossible. Further details here.)

Comment by Steven Byrnes (steve2152) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-27T13:06:11.602Z · LW · GW

I don’t think I was making that argument.

If lots of people have a false belief X, that’s prima facie evidence that “X is false” is newsworthy. There’s probably some reason that X rose to attention in the first place; and if nothing else, “X is false” at the very least should update our priors about what fraction of popular beliefs are true vs false.

Once we’ve established that “X is false” is newsworthy at all, we still need to weigh the cost vs benefits of disseminating that information.

I hope that everyone including rationalists are in agreement about all this. For example, prominent rationalists are familiar with the idea of infohazards, reputational risks, picking your battles, simulacra 2, and so on. I’ve seen a lot of strong disagreement on this forum about what newsworthy information should and shouldn’t be disseminated and in what formats and contexts. I sure have my own opinions!

…But all that is irrelevant to this discussion here. I was talking about whether Scott’s last name is newsworthy in the first place. For example, it’s not the case that lots of people around the world were under the false impression that Scott’s true last name was McSquiggles, and now NYT is going to correct the record. (It’s possible that lots of people around the world were under the false impression that Scott’s true last name is Alexander, but that misconception can be easily correctly by merely saying it’s a pseudonym.) If Scott’s true last name revealed that he was secretly British royalty, or secretly Albert Einstein’s grandson, etc., that would also at least potentially be newsworthy.

Not everything is newsworthy. The pebbles-on-the-sidewalk example I mentioned above is not newsworthy. I think Scott’s name is not newsworthy either. Incidentally, I also think there should be a higher bar for what counts as newsworthy in NYT, compared to what counts as newsworthy when I’m chatting with my spouse about what happened today, because of the higher opportunity cost.

Comment by Steven Byrnes (steve2152) on Modern Transformers are AGI, and Human-Level · 2024-03-26T23:23:57.727Z · LW · GW

My complaint about “transformative AI” is that (IIUC) its original and universal definition is not about what the algorithm can do but rather how it impacts the world, which is a different topic. For example, the very same algorithm might be TAI if it costs $1/hour but not TAI if it costs $1B/hour, or TAI if it runs at a certain speed but not TAI if it runs many OOM slower, or “not TAI because it’s illegal”. Also, two people can agree about what an algorithm can do but disagree about what its consequences would be on the world, e.g. here’s a blog post claiming that if we have cheap AIs that can do literally everything that a human can do, the result would be “a pluralistic and competitive economy that’s not too different from the one we have now”, which I view as patently absurd.

Anyway, “how an AI algorithm impacts the world” is obviously an important thing to talk about, but “what an AI algorithm can do” is also an important topic, and different, and that’s what I’m asking about, and “TAI” doesn’t seem to fit it as terminology.

Comment by Steven Byrnes (steve2152) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T21:45:37.762Z · LW · GW

There's a fact of the matter about whether the sidewalk on my street has an odd vs even number of pebbles on it, but I think everyone including rationalists will agree that there's no benefit of sharing that information. It's not relevant for anything else.

By contrast, taboo topics generally become taboo because they have important consequences for decisions and policy and life.

Comment by Steven Byrnes (steve2152) on My Interview With Cade Metz on His Reporting About Slate Star Codex · 2024-03-26T20:31:00.231Z · LW · GW

There were two issues: what is the cost of doxxing, and what is the benefit of doxxing. I think the main crux an equally important crux of disagreement is the latter, not the former. IMO the benefit was zero: it’s not newsworthy, it brings no relevant insight, publishing it does not advance the public interest, it’s totally irrelevant to the story. Here CM doesn’t directly argue that there was any benefit to doxxing; instead he kinda conveys a vibe / ideology that if something is true then it is self-evidently intrinsically good to publish it (but of course that self-evident intrinsic goodness can be outweighed by sufficiently large costs). Anyway, if the true benefit is zero (as I believe), then we don’t have to quibble over whether the cost was big or small.

Comment by Steven Byrnes (steve2152) on Modern Transformers are AGI, and Human-Level · 2024-03-26T20:07:07.042Z · LW · GW

I’m talking about the AI’s ability to learn / figure out a new system / idea / domain on the fly. It’s hard to point to a particular “task” that specifically tests this ability (in the way that people normally use the term “task”), because for any possible task, maybe the AI happens to already know how to do it.

You could filter the training data, but doing that in practice might be kinda tricky because “the AI already knows how to do X” is distinct from “the AI has already seen examples of X in the training data”. LLMs “already know how to do” lots of things that are not superficially in the training data, just as humans “already know how to do” lots of things that are superficially unlike anything they’ve seen before—e.g. I can ask a random human to imagine a purple colander falling out of an airplane and answer simple questions about it, and they’ll do it skillfully and instantaneously. That’s the inference algorithm, not the learning algorithm.

Well, getting an AI to invent a new scientific field would work as such a task, because it’s not in the training data by definition. But that’s such a high bar as to be unhelpful in practice. Maybe tasks that we think of as more suited to RL, like low-level robot control, or skillfully playing games that aren’t like anything in the training data?

Separately, I think there are lots of domains where “just generate synthetic data” is not a thing you can do. If an AI doesn’t fully ‘understand’ the physics concept of “superradiance” based on all existing human writing, how would it generate synthetic data to get better? If an AI is making errors in its analysis of the tax code, how would it generate synthetic data to get better? (If you or anyone has a good answer to those questions, maybe you shouldn’t publish them!! :-P )

Comment by Steven Byrnes (steve2152) on On attunement · 2024-03-26T18:45:55.265Z · LW · GW

For example, if you didn’t know that walking near a wasp nest is a bad idea, and then you do so, then I guess you could say “some part of the world comes forward … strangely new, and shining with meaning”, because from now on into the future, whenever you see a wasp nest, it will pop out with a new salient meaning “Gah those things suck”.

You wouldn’t use the word “attunement” for that obviously. “Attunement” is one of those words that can only refer to good things by definition, just as the word “contamination” can only refer to bad things by definition (detailed discussion here).

Comment by Steven Byrnes (steve2152) on Modern Transformers are AGI, and Human-Level · 2024-03-26T18:18:44.481Z · LW · GW

Well I’m one of the people who says that “AGI” is the scary thing that doesn’t exist yet (e.g. FAQ  or “why I want to move the goalposts on ‘AGI’”). I don’t think “AGI” is a perfect term for the scary thing that doesn’t exist yet, but my current take is that “AGI” is a less bad term compared to alternatives. (I was listing out some other options here.) In particular, I don’t think there’s any terminological option that is sufficiently widely-understood and unambiguous that I wouldn’t need to include a footnote or link explaining exactly what I mean. And if I’m going to do that anyway, doing that with “AGI” seems OK. But I’m open-minded to discussing other options if you (or anyone) have any.

Generative pre-training is AGI technology: it creates a model with mediocre competence at basically everything.

I disagree with that—as in “why I want to move the goalposts on ‘AGI’”, I think there’s an especially important category of capability that entails spending a whole lot of time working with a system / idea / domain, and getting to know it and understand it and manipulate it better and better over the course of time. Mathematicians do this with abstruse mathematical objects, but also trainee accountants do this with spreadsheets, and trainee car mechanics do this with car engines and pliers, and kids do this with toys, and gymnasts do this with their own bodies, etc. I propose that LLMs cannot do things in this category at human level, as of today—e.g. AutoGPT basically doesn’t work, last I heard. And this category of capability isn’t just a random cherrypicked task, but rather central to human capabilities, I claim. (See Section 3.1 here.)

Comment by Steven Byrnes (steve2152) on On attunement · 2024-03-25T15:20:35.035Z · LW · GW

One of the classic sketches of a utility-maximizing agent (“observation-utility maximizer”, cf Stable Pointers to Value) involves a “utility function box”. A proposal subsystem feeds a possible course-of-action and resulting expected future state of the world into the utility-function box, and the box says whether that plan is a good idea or bad idea. Repeat for many possible proposals, and then the agent executes the best proposal according to the utility-function box.

For the above agent, you’d say red=utility function box, blue=an “epistemic subsystem” within the source code designed to make increasingly accurate predictions of the future state of the world (which, again, are part of what gets fed into the box), black is because power is instrumentally useful towards red, blue is also because knowledge is instrumentally useful towards red (and therefore highly-capable agents do not generally self-modify to sabotage their “epistemic subsystems” but rather if anything self-modify to make them work ever better), white is absent unless red happens to point at it, and green is absent altogether. I think this is related to your observation that “anti-realist rationality struggles to capture” green (and white).

I think the human brain is close to that picture, but with a couple key edits that are key to the algorithms actually working in practice, and those edits bring green into the picture.

The first edit is that the human “utility function box” starts blank/random and is edited over time by a different system (centrally involving the brainstem and “innate drives”). I think this is closely related to how a reward function continually sculpts a value function via TD learning in actor-critic RL. (The correspondence is “value function” ↔ “utility function box”, “reward function” ↔ “a different system”.) (For example, the trained AlphaZero value function “sees the beauty of such-and-such abstract patterns in Go pieces”, so to speak, even though that appreciation wasn’t in the AlphaZero source code; and by the same token the hippie “sees the beauty in the cycle of life”, especially after an acid trip, even if there’s nothing particularly related to that preference in the human genome.) (More details in §9.5 here.)

The second edit is that the human equivalent of a “utility function box” can find anything appealing, not just a possible state of the world in the distant future. (Specifically, it can grow to “like” any latent variable in the current world-model, see The Pointers Problem or §9.2 here.) That’s why humans can find themselves motivated by intuitive rules, virtues, norms, etc., and not just “outcomes”—see my discussion at Consequentialism & Corrigibility.

Putting these two together, we find that (1) humans inevitably have experiences that change our goals and values, (2) it’s possible for humans to come to intrinsically value the existence of such experiences, and thus e.g. write essays about the profound importance of attunement.

Of course, (2) is not the only possibility; another possibility is winding up with an agent that does have preferences about the future state of the world, and where those preferences are sufficiently strong that the agent self-modifies to stop any further (1) (cf. instrumental convergence).

Humans can go both ways about (1). Sometimes we see (1) as bad (I don’t want to get addicted to heroin, I don’t want to get brainwashed, I don’t want to stop caring about my children), and sometimes we see (1) as good (I generally like the idea that my personality and preferences will continue evolving with time and life experience etc.). This essay talks a lot about the latter but doesn’t mention the former (nothing wrong with that, it’s just interesting).

I’m not trying to prove any particular point here, just riffing/chatting.

Comment by Steven Byrnes (steve2152) on Social status part 1/2: negotiations over object-level preferences · 2024-03-20T03:08:07.617Z · LW · GW

I think most people want to be able to tell themselves a story in which they act in a way that society sees as praiseworthy, or at least not too blameworthy.

I think there’s a blurry line between whether that general preference is about “self-image” versus “image of oneself from the perspective of an imagined third party”. I’m not even sure if there’s a line at all—maybe that’s just saying the same thing twice.

Anyway, lying is generally seen as bad and blameworthy in our culture (with some exceptions like “lying to the hated outgroup in order to help avert the climate crisis”, or “a white lie for the benefit of the hearer”, etc.). Spinning / being misleading is generally seen as OK, or at least less bad, in our culture—everyone does that all the time.

Given that cultural background, obviously most people will feel motivated to spin rather than lie.

But that just pushes the question to “why is there a stronger cultural norm against lying than against spinning”? Probably just what you wrote—it’s easier to get away with spinning because of the “common knowledge of guilt” thing. It’s harder to police, so more people do it. And then everyone kinda get inured to it and starts seeing it as (relatively) culturally acceptable, I think.

Separately, I kinda think there was never any reason to expect the listener’s preferences to enter the equation. If I cared so much about the listener’s preferences, I wouldn’t be trying to deceive them in the first place, right? Even if I nominally cared about the listener’s preferences, well, accurately seeing a situation from someone else’s perspective is hard and rare even under the best of circumstances (i.e., when you like them and know them well and they’re in the same room as you). The result you mention (I presume this paper) is not the best of circumstances—it asks survey takers to answer questions about imaginary vignettes, which is kinda stacking the deck even more than usual against caring about the listener’s preferences. (And maybe all surveys are BS anyway.)

Comment by Steven Byrnes (steve2152) on Social status part 1/2: negotiations over object-level preferences · 2024-03-15T15:09:28.483Z · LW · GW

The very next paragraph after the dinosaur-train thing says:

Of course, in polite society, we would typically both be “50%-leading”, or at least “close-to-50% leading”, and thus we would deftly and implicitly negotiate a compromise. Maybe the conversation topic would bounce back and forth between trains and dinosaurs, or we would find something else entirely to talk about, or we would stop talking and go watch TV, or we would find an excuse to cordially end our interaction, etc.

I think it’s really obvious that friends seek compromises and win-win-solutions where possible, and I think it’s also really obvious that not all participants in an interaction are going to wind up in the best of all possible worlds by their own lights. I think you’re unhappy that I’m spending the whole post talking about the latter, and basically don’t talk about finding win-wins apart from that one little paragraph, because you feel that this choice of emphasis conveys a vibe of “people are just out to get each other and fight for their own interest all the time” instead of a vibe of “kumbaya let’s all be friends”. If so, that’s not deliberate. I don’t feel that vibe and was not trying to convey it. I have friends and family just like you. Instead, I’m focusing on the latter because I think I have interesting things to say about it.

I think there’s also something else going on with your comment though…

As I mention in the third paragraph, there’s a kind of cultural taboo where we’re all supposed to keep up the pretense that mildly conflicting preferences between good friends simply don’t exist. Objectively speaking, I mean, what are the chances that my object-level preferences are exactly the same as yours down to the twelfth decimal place? Zero, right? But if we’re chatting, and you would very very slightly rather continue talking about sports, while I would very very slightly rather change the subject to the funny story that I heard at work, then we will mutually engage in a conspiracy of silence about this slight divergence, because mentioning the divergence out loud (and to some extent, even bringing it to conscious awareness in the privacy of your own head!) is not treated as pointing out an obvious innocuous truth, but rather a rude and pushy declaration that our perfectly normal slight divergence in immediate conversational interests is actually a big deal that poses a serious threat to our ability to get along and has to be somehow dealt with. It’s not! It’s perfectly fine and normal!

The post itself attempts to explain why this taboo / pretense / conspiracy-of-silence exists. And it would be kinda ironic if the post itself is getting misunderstood thanks to the very same conversational conventions that it is attempting to explain.  :)

Comment by Steven Byrnes (steve2152) on Social status part 1/2: negotiations over object-level preferences · 2024-03-15T10:58:41.365Z · LW · GW

Thanks for the suggestions; I rewrote the intro, and what you call "Section 1.3" is the new "Section 1.2".

Comment by Steven Byrnes (steve2152) on Highlights from Lex Fridman’s interview of Yann LeCun · 2024-03-13T23:51:41.542Z · LW · GW

Here are clarifications for a couple minor things I was confused about while reading: 

a GPU is way below the power of the human brain. You need something like 100,000 or a million to match it, so we are off by a huge factor here.

I was trying to figure out where LeCun’s 100,000+ claim is coming from, and I found this 2017 article which is paywalled but the subheading implies that he’s focusing on the 10^14 synapses in a human brain, and comparing that to the number of neuron-to-neuron connections that a GPU can handle.

(If so, I strongly disagree with that comparison, but I don’t want to get into that in this little comment.)

Francois Chollet says “The actual information input of the visual system is under 1MB/s”.

I don’t think Chollet is directly responding to LeCun’s point, because Chollet is saying that optical information is compressible to <1MB/s, but LeCun is comparing uncompressed human optical bits to uncompressed LLM text bits. And the text bits are presumably compressible too!

It’s possible that the compressibility (useful information content per bit) of optical information is very different from the compressibility of text information, in which case LeCun’s comparison is misleading, but neither LeCun nor Chollet is making claims one way or the other about that, AFAICT.

Comment by Steven Byrnes (steve2152) on Some (problematic) aesthetics of what constitutes good work in academia · 2024-03-13T14:09:33.518Z · LW · GW

Fwiw I don't think the main paper would have been much shorter if we'd aimed to write a blog post instead…

Oops. Thanks. I should have checked more carefully before writing that. I was wrong and have now put a correction into the post.

Comment by Steven Byrnes (steve2152) on “Artificial General Intelligence”: an extremely brief FAQ · 2024-03-13T13:55:10.150Z · LW · GW

I think you’re not the target audience for this post.

Pick a random person on the street who has used chatGPT, and ask them to describe a world in which we have AI that is “like chatGPT but better”. I think they’ll describe a world very much like today’s, but where chatGPT hallucinates less and writes better essays. I really don’t think they’ll describe the right-column world. If you’re imagining the right-column world, then great! Again, you’re not the target audience.

Comment by Steven Byrnes (steve2152) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-13T13:29:59.605Z · LW · GW

I think you’re responding to something different than what I was saying.

Again, let’s say Bob wants to sit at the cool kid’s table at lunch, and Bob dreams of being a movie star at dinner. Bob feels motivated to do one thing, and then later on Bob feels motivated to do the other thing. Both are still clearly goal-directed behaviors: At lunchtime, Bob’s “planning machinery” is pointed towards “sitting at the cool kid’s table”, and at dinnertime, Bob’s “planning machinery” is pointed towards “being a movie star”. Neither of these things can be accomplished by unthinking habits and reactions, obviously.

I think there’s a deep-seated system in the brainstem (or hypothalamus). When Bob’s world-model (cortex) is imagining a future where he is sitting at the cool kid’s table, then this brainstem system flags that future as “desirable”. Then later on, when Bob’s world-model (cortex) is imagining a future where he is a movie star, then this brainstem system flags that future as “desirable”. But from the perspective of Bob’s world-model / cortex / conscious awareness (both verbalized and not), there does not have to be any concept that makes a connection between “sit at the cool kid’s table” and “be a movie star”. Right?

By analogy, if Caveman Oog feels motivated to eat meat sometimes, and to eat vegetables other times, then it might or might not be the case that Oog has a single concept akin to the English word “eating” that encompasses both eating-meat and eating-vegetables. Maybe in his culture, those are thought of as two totally different activities—the way we think of eating versus dancing. It’s not like there’s no overlap between eating and dancing—your heart is beating in both cases, it’s usually-but-not-always a group activity in both cases, it alleviates boredom in both cases—but there isn’t any concept in English unifying them. Likewise, if you asked Oog about eating-meat versus eating-vegetables, he would say “huh, never thought about that, but yeah sure, I guess they do have some things in common, like both involve putting stuff into one’s mouth and moving the jaw”. I’m not saying that this Oog thought experiment is likely, but it’s possible, right? And that illustrates the fact that coherently-and-systematically-planning-to-eat does not rely on having a concept of “eating”, whether verbalized or not.

Comment by Steven Byrnes (steve2152) on Some (problematic) aesthetics of what constitutes good work in academia · 2024-03-13T12:56:28.773Z · LW · GW

Point 1: I think “writing less concisely than would be ideal” is the natural default for writers, so we don’t need to look to incentives to explain it. Pick up any book of writing advice and it will say that, right? “You have to kill your darlings”, “If I had more time, I would have written a shorter letter”, etc.

Point 2: I don’t know if this applies to you-in-particular, but there’s a systematic dynamic where readers generally somewhat underestimate the ideal length of a piece of nonfiction writing. The problem is, the writer is writing for a heterogeneous audience of readers. Different readers are coming in with different confusions, different topics-of-interest, different depths-of-interest, etc. So you can imagine, for example, that every reader really only benefits from 70% of the prose … but it’s a different 70% for different readers. Then each individual reader will be complaining that it’s unnecessarily long, but actually it can’t be cut at all without totally losing a bunch of the audience.

(To be clear, I think both of these are true—Point 2 is not meant as a denial to Point 1; not all extra length is adding anything. I think the solution is to both try to write concisely and make it easy for the reader to recognize and skip over the parts that they don’t need to read, for example with good headings and a summary / table-of-contents at the top. Making it fun to read can also somewhat substitute for making it quick to read.)

Comment by Steven Byrnes (steve2152) on Some (problematic) aesthetics of what constitutes good work in academia · 2024-03-12T22:27:01.459Z · LW · GW

Hmm, I notice a pretty strong negative correlation between how long it takes me to write a blog post and how much karma it gets. For example, very recently I spent like a month of full-time work to write two posts on social status (karma = 71 & 36), then I took a break to catch up on my to-do list, in the course of which I would sometimes spend a few hours dashing off a little post, and there have been three posts in that category, and their karma is 57, 60, 121 (this one). So, 20ish times less effort, somewhat more karma. This is totally in line with my normal expectations.

I think that’s because if I’m spending a month on a blog post then it’s probably going to be full of boring technical details such that it’s not fun to read, and if I spend a few hours on a blog post like this one, it’s gonna consist of stories and rants and wild speculation and so on, which is more fun to read.

In terms of word count, here you go, I did the experiment:

I could make a long list of “advice” to get lots of lesswrong karma (but that probably actually makes a post less valuable), but I don’t think “conspicuous signals of effort” would be one of them. Instead it would be things like: Give it a clickbaity title & intro, Make it about an ongoing hot debate (e.g. the eternal battle between “yay Eliezer” vibes versus “boo Eliezer” vibes, or the debate over whether p(doom) is high versus low, AGI timelines, etc.), Make it reinforce popular rationalist tribal beliefs (yay YIMBY, boo FDA, etc.—I guess this post is doing that a bit), make it an easy read, don’t mention AI because the AI tag gets penalized by default in the frontpage ranking, etc. My impression is that length per se is not particularly rewarded in terms of LW karma, and that the kind of “rigor” that would be well-received in peer-review (e.g. comprehensive lit reviews) is a negative in terms of lesswrong karma.

Of course this is stupid, because karma is meaningless internet points, and the obvious right answer is to basically not care about lesswrong karma in the first place. Instead I recommend metrics like “My former self would have learned a lot from reading this” or “This constitutes meaningful progress on such-and-such long-term project that I’m pursuing and care about”. For example, I have a number of super-low-karma posts that I feel great pride and affinity towards. I am not doing month-long research projects because it’s a good way to get karma, which it’s not, but rather because it’s a good way to make progress on my long-term research agenda. :)

Comment by Steven Byrnes (steve2152) on “Artificial General Intelligence”: an extremely brief FAQ · 2024-03-12T16:17:38.002Z · LW · GW

Thanks. I made some edits to the questions. I’m open to more suggestions.

This is already version 3 of that image (see v1,v2) but I’m also very open to suggestions on that too.

Comment by Steven Byrnes (steve2152) on Deconstructing Bostrom's Classic Argument for AI Doom · 2024-03-11T18:50:06.831Z · LW · GW

Agree—I was also arguing for “trivial” in this EA Forum thread a couple years ago.

Comment by Steven Byrnes (steve2152) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-11T12:49:04.508Z · LW · GW

I want to briefly note my disagreement: I think the genome specifically builds what might be called an innate status drive into the brain (stronger in some people than others), in addition to within-lifetime learning. See my discussions here and here, plus this comment thread, and hopefully better discussion in future posts.

Comment by Steven Byrnes (steve2152) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-11T12:38:41.497Z · LW · GW

I was asking because I published 14,000 words on the phenomenon of human status-seeking last week. :-P I agree that there have been many oversimplified accounts of how status works. I hope mine is not one of them. I agree that “status is not a single variable” and that “deference tree” accounts are misleading. (I think the popular lesswrong / Robin Hanson view is that status is two variables rather than one, but I think that’s still oversimplified.)

I don’t think the way that “lesswrong community members” actually relate to each other is “ruthless sociopathic businesspeople … command hierarchy … deference tree” kind of stuff. I mean, there’s more-than-zero of that, but not much, and I think less of it in lesswrong than in most groups that I’ve experienced—I’m thinking of places I’ve worked, college clubs, friend groups, etc. Hmm, oh here’s an exception, “the group of frequent Wikipedia physics article editors from 2005-2018” was noticeably better than lesswrong on that axis, I think. I imagine that different people have different experiences of the “lesswrong community” though. Maybe I have subconsciously learned to engage with some parts of the community more than others.

Comment by Steven Byrnes (steve2152) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-11T11:22:50.105Z · LW · GW

I don't me that humans become aligned with their explicit verbal concept of status. I mean that (many) humans are aligned with the intuitive concept that they somehow learn over the course of development.

How do you know that there is any intuitive concept there? For example, if Bob wants to sit at the cool kid’s table at lunch and Bob dreams of being a movie star at dinner, who’s to say that there is a single concept in Bob’s brain, verbalized or not, active during both those events and tying them together? Why can’t it simply be the case that Bob feels motivated to do one thing, and then later on Bob feels motivated to do the other thing?

Comment by Steven Byrnes (steve2152) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-11T01:38:11.416Z · LW · GW

Is that intended to mean “lesswrong people are obsessed with their own and each other’s status”, or “lesswrong people are obsessed with the phenomenon of human status-seeking”? (or something else?)

Comment by Steven Byrnes (steve2152) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-11T00:07:06.727Z · LW · GW

I disagree with “natural selection got the concept of "social status" into us” or that status-seeking behavior is tied to “having an intuitive "status" concept”.

For example, if Bob wants to be a movie star, then from the outside you and I can say that Bob is status-seeking, but it probably doesn’t feel like that to Bob; in fact Bob might not know what the word “status” means, and Bob might be totally oblivious to the existence of any connection between his desire to be a movie star and Alice’s desire to be a classical musician and Carol’s desire to eat at the cool kids table in middle school.

I think “status seeking” is a mish-mosh of a bunch of different things but I think an important one is very roughly “it’s intrinsically motivating to believe that other people like me”. (More discussion in §2.2.2 & §2.6.1 here and hopefully more in future posts.) I think it’s possible for the genome to build “it’s intrinsically motivating to believe that other people like me” into the brain whereas it would not be analogously possible for the genome to build “it’s intrinsically motivating to have a high inclusive genetic fitness” into the brain. There are many reasons that the latter is not realistic, not least of which is that inclusive genetic fitness is only observable in hindsight, after you’re dead.

Comment by Steven Byrnes (steve2152) on Evolution did a surprising good job at aligning humans...to social status · 2024-03-10T23:52:36.726Z · LW · GW

…and eating, and breastfeeding…

Comment by Steven Byrnes (steve2152) on When and why did 'training' become 'pretraining'? · 2024-03-08T15:25:21.403Z · LW · GW

Other people know more than me, but my impression was that the heritage of LLMs was things like ULMFiT (2018), where the goal was not to generate text but rather to do non-generative NLP tasks like sentiment-classification, spam-detection, and so on. Then you (1) do self-supervised “pretraining”, (2) edit/replace the output layer(s) to convert it from “a model that can output token predictions” to “a model that can output text-classifier labels / scores”, (3) fine-tune this new model (especially the newly-added parts) on human-supervised (text, label) pairs. Or something like that.

The word “pretraining” makes more sense than “training” in that context because “training” would incorrectly imply “training the model to do text classification”, i.e. the eventual goal. …And then I guess the term “pretraining” stuck around after it stopped making so much sense.

Comment by Steven Byrnes (steve2152) on Many arguments for AI x-risk are wrong · 2024-03-07T13:31:15.415Z · LW · GW

In RLHF, if you want the AI to do X, then you look at the two options and give a give thumbs-up to the one where it’s doing more X rather than less X. Very straightforward!

By contrast, if the MTurkers want AlphaZero-MTurk to do X, then they have their work cut out. Their basic strategy would have to be: Wait for AlphaZero-MTurk to do X, and then immediately throw the game (= start deliberately making really bad moves). But there are a bunch of reasons that might not work well, or at all: (1) if AlphaZero-MTurk is already in a position where it can definitely win, then the MTurkers lose their ability to throw the game (i.e., if they start making deliberately bad moves, then AlphaZero-MTurk would have its win probability change from ≈100% to ≈100%), (2) there’s a reward-shaping challenge (i.e., if AlphaZero-MTurk does something close to X but not quite X, should you throw the game or not? I guess you could start playing slightly worse, in proportion to how close the AI is to doing X, but it’s probably really hard to exercise such fine-grained control over your move quality), (3) If X is a time-extended thing as opposed to a single move (e.g. “X = playing in a conservative style” or whatever), then what are you supposed to do? (4) Maybe other things too.

Comment by Steven Byrnes (steve2152) on Rationality Research Report: Towards 10x OODA Looping? · 2024-03-07T02:57:24.587Z · LW · GW

something that's really weirded me out about the literature on IQ, transfer learning, etc, is that... it seems like it's just really hard to transfer learn. We've basically failed to increase g, and the "transfer learning demonstrations" I've heard of seemed pretty weaksauce.

You might be referring to the skeptical take on transfer learning, summarized as follows in Surfaces and Essences by Hofstadter & Sander:

Experimental studies have indeed demonstrated that subjects who are shown a source situation and who are then given a target situation are usually unable to see any connection between the two unless they share surface-level traits. Furthermore, in such experiments, when two situations have a superficial resemblance, then the second one invariably brings the first one to mind, no matter whether it is appropriate or not (that is, irrespective of whether there are deeper reasons to connect the two cases). For instance, if subjects first tackle an arithmetic problem concerning items bought in a store, then any other problem concerning purchases will instantly remind them of the initial problem. But if the theme of the first problem is experimentally manipulated say it becomes a visit to a doctor’s office instead of a store — then the participants will almost surely see no link between the two stories, even if the solution method for the first problem applies perfectly to the second problem.

But then the authors argue that this skeptical take is misleading:

Unfortunately, the source–target [experimental] paradigm [in the studies above] has a serious defect that undermines the generality of the conclusions that experiments based upon it produce. This defect stems from the fact that the knowledge acquired about the source situation during the twenty minutes or so of a typical experiment is perforce very limited — often consisting merely in the application of a completely unfamiliar formula to a word problem. By contrast, when in real life we are faced with a new situation and have to decide what to do, the source situations we retrieve spontaneously and effortlessly from our memories are, in general, extremely familiar. We all depend implicitly on knowledge deeply rooted in our experiences over a lifetime, and this knowledge, which has been confirmed and reconfirmed over and over again, has also been generalized over time, allowing it to be carried over fluidly to all sorts of new situations. It is very rare that, in real life, we rely on an analogy to a situation with which we are barely familiar at all. To put it more colorfully, when it comes to understanding novel situations, we reach out to our family and our friends rather than to the first random passerby. But in the source–target paradigm, experimental subjects are required to reach out to a random passerby—namely, the one that was imposed on them as a source situation by the experimenter.

And so, what do the results obtained in the framework of this paradigm really demonstrate? What they show is that when people learn something superficially, they wind up making superficial analogies to it.

To rephrase: The problem is that, in the experimental protocol, the subjects only ever wind up with a crappy surface-level understanding of the source situation, not a deep mental model of the source situation reflective of true familiarity / expertise. When people do have real comfort and familiarity with the source situation, then they find deep structural analogies all over the place.

For example (these are my examples), if you talk to an economist about some weird situation, they will easily notice that there’s a supply-and-demand way to look at it, and ditto gains-from-trade and so on. And physicists will analogize random things to superpositions and fourier-space and so on, etc. Of course, the main thing that everyone is an “expert” in is “intuitive everyday life stuff”, and hence our thinking and speech is full of constant non-surface-level analogies to traveling, seasons, ownership, arguments, etc. etc.

I’m not sure if this is relevant to what you were saying, just thought I’d share.

Comment by Steven Byrnes (steve2152) on Many arguments for AI x-risk are wrong · 2024-03-06T16:49:59.433Z · LW · GW

(Disclaimer: Nothing in this comment is meant to disagree with “I just think it's not plausible that we just keep scaling up [LLM] networks, run pretraining + light RLHF, and then produce a schemer.” I’m agnostic about that, maybe leaning towards agreement, although that’s related to skepticism about the capabilities that would result.)

It is simply not true that "[RL approaches] typically involve creating a system that seeks to maximize a reward signal."

I agree that Bostrom was confused about RL. But I also think there are some vaguely-similar claims to the above that are sound, in particular:

  • RL approaches may involve inference-time planning / search / lookahead, and if they do, then that inference-time planning process can generally be described as “seeking to maximize a learned value function / reward model / whatever” (which need not be identical to the reward signal in the RL setup).
  • And if we compare Bostrom’s incorrect “seeking to maximize the actual reward signal” to the better “seeking at inference time to maximize a learned value function / reward model / whatever to the best of its current understanding”, then…
  • RL approaches historically have typically involved the programmer wanting to get a maximally high reward signal, and creating a training setup such that the resulting trained model does stuff that get as high a reward signal as possible. And this continues to be a very important lens for understanding why RL algorithms work the way they work. Like, if I were teaching an RL class, and needed to explain the formulas for TD learning or PPO or whatever, I think I would struggle to explain the formulas without saying something like “let’s pretend that you the programmer are interested in producing trained models that score maximally highly according to the reward function. How would you update the model parameters in such-and-such situation…?” Right?
  • Related to the previous bullet, I think many RL approaches have a notion of “global optimum” and “training to convergence” (e.g. given infinite time in a finite episodic environment). And if a model is “trained to convergence”, then it will behaviorally “seek to maximize a reward signal”. I think that’s important to have in mind, although it might or might not be relevant in practice.

I bet people would care a lot less about “reward hacking” if RL’s reinforcement signal hadn’t ever been called “reward.”

In the context of model-based planning, there’s a concern that the AI will come upon a plan which from the AI’s perspective is a “brilliant out-of-the-box solution to a tricky problem”, but from the programmer’s perspective is “reward-hacking, or Goodharting the value function (a.k.a. exploiting an anomalous edge-case in the value function), or whatever”. Treacherous turns would probably be in this category.

There’s a terminology problem where if I just say “the AI finds an out-of-the-box solution”, it conveys the positive connotation but not the negative one, and if I just say “reward-hacking” or “Goodharting the value function” it conveys the negative part without the positive.

The positive part is important. We want our AIs to find clever out-of-the-box solutions! If AIs are not finding clever out-of-the-box solutions, people will presumably keep improving AI algorithms until they do.

Ultimately, we want to be able to make AIs that think outside of some of the boxes but definitely stay inside other boxes. But that’s tricky, because the whole idea of “think outside the box” is that nobody is ever aware of which boxes they are thinking inside of.

Anyway, this is all a bit abstract and weird, but I guess I’m arguing that I think the words “reward hacking” are generally pointing towards an very important AGI-safety-relevant phenomenon, whatever we want to call it.