Posts

Response to Dileep George: AGI safety warrants planning ahead 2024-07-08T15:27:07.402Z
Incentive Learning vs Dead Sea Salt Experiment 2024-06-25T17:49:01.488Z
(Appetitive, Consummatory) ≈ (RL, reflex) 2024-06-15T15:57:39.533Z
[Valence series] 4. Valence & Liking / Admiring 2024-06-10T14:19:51.194Z
Response to nostalgebraist: proudly waving my moral-antirealist battle flag 2024-05-29T16:48:29.408Z
Spatial attention as a “tell” for empathetic simulation? 2024-04-26T15:10:58.040Z
A couple productivity tips for overthinkers 2024-04-20T16:05:50.332Z
“Artificial General Intelligence”: an extremely brief FAQ 2024-03-11T17:49:02.496Z
Some (problematic) aesthetics of what constitutes good work in academia 2024-03-11T17:47:28.835Z
Woods’ new preprint on object permanence 2024-03-07T21:29:57.738Z
Social status part 2/2: everything else 2024-03-05T16:29:19.072Z
Social status part 1/2: negotiations over object-level preferences 2024-03-05T16:29:07.143Z
Four visions of Transformative AI success 2024-01-17T20:45:46.976Z
Deceptive AI ≠ Deceptively-aligned AI 2024-01-07T16:55:13.761Z
[Valence series] Appendix A: Hedonic tone / (dis)pleasure / (dis)liking 2023-12-20T15:54:17.131Z
[Valence series] 5. “Valence Disorders” in Mental Health & Personality 2023-12-18T15:26:29.970Z
[Valence series] 4. Valence & Social Status (deprecated) 2023-12-15T14:24:41.040Z
[Valence series] 3. Valence & Beliefs 2023-12-11T20:21:30.570Z
[Valence series] 2. Valence & Normativity 2023-12-07T16:43:49.919Z
[Valence series] 1. Introduction 2023-12-04T15:40:21.274Z
Thoughts on “AI is easy to control” by Pope & Belrose 2023-12-01T17:30:52.720Z
I’m confused about innate smell neuroanatomy 2023-11-28T20:49:13.042Z
8 examples informing my pessimism on uploading without reverse engineering 2023-11-03T20:03:50.450Z
Late-talking kid part 3: gestalt language learning 2023-10-17T02:00:05.182Z
“X distracts from Y” as a thinly-disguised fight over group status / politics 2023-09-25T15:18:18.644Z
A Theory of Laughter—Follow-Up 2023-09-14T15:35:18.913Z
A Theory of Laughter 2023-08-23T15:05:59.694Z
Model of psychosis, take 2 2023-08-17T19:11:17.386Z
My checklist for publishing a blog post 2023-08-15T15:04:56.219Z
Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions 2023-07-19T14:26:05.675Z
Thoughts on “Process-Based Supervision” 2023-07-17T14:08:57.219Z
Munk AI debate: confusions and possible cruxes 2023-06-27T14:18:47.694Z
My side of an argument with Jacob Cannell about chip interconnect losses 2023-06-21T13:33:49.543Z
LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem 2023-05-08T19:35:19.180Z
Connectomics seems great from an AI x-risk perspective 2023-04-30T14:38:39.738Z
AI doom from an LLM-plateau-ist perspective 2023-04-27T13:58:10.973Z
Is “FOXP2 speech & language disorder” really “FOXP2 forebrain fine-motor crappiness”? 2023-03-23T16:09:04.528Z
EAI Alignment Speaker Series #1: Challenges for Safe & Beneficial Brain-Like Artificial General Intelligence with Steve Byrnes 2023-03-23T14:32:53.800Z
Plan for mediocre alignment of brain-like [model-based RL] AGI 2023-03-13T14:11:32.747Z
Why I’m not into the Free Energy Principle 2023-03-02T19:27:52.309Z
Why I’m not working on {debate, RRM, ELK, natural abstractions} 2023-02-10T19:22:37.865Z
Heritability, Behaviorism, and Within-Lifetime RL 2023-02-02T16:34:33.182Z
Schizophrenia as a deficiency in long-range cortex-to-cortex communication 2023-02-01T19:32:24.447Z
“Endgame safety” for AGI 2023-01-24T14:15:32.783Z
Thoughts on hardware / compute requirements for AGI 2023-01-24T14:03:39.190Z
Note on algorithms with multiple trained components 2022-12-20T17:08:24.057Z
More notes from raising a late-talking kid 2022-12-20T02:13:01.018Z
My AGI safety research—2022 review, ’23 plans 2022-12-14T15:15:52.473Z
The No Free Lunch theorem for dummies 2022-12-05T21:46:25.950Z
My take on Jacob Cannell’s take on AGI safety 2022-11-28T14:01:15.584Z

Comments

Comment by Steven Byrnes (steve2152) on dddiiiiitto's Shortform · 2024-09-11T01:23:50.241Z · LW · GW

I happened upon a X thread where cremieuxrecueil describes a particular study that concludes fluoridation is fine, then Scott Alexander replies that the literature is complicated and prenatal exposure seems at least plausibly bad, and then cremeiuxrecueil replies that the rest of the literature is worse but agrees that bad prenatal effects are still possible. Also, a couple other people including gwern chimed in to agree that the study cremieuxrecueil likes is indeed the best study in the literature.

Comment by Steven Byrnes (steve2152) on Why Large Bureaucratic Organizations? · 2024-08-27T19:53:19.807Z · LW · GW

… and it turns out the behavior researchers have done exactly that kind of test; they call it testing for “linearity”. Indeed, they’ve done it many many times over, with several different operationalizations of the statistics, in a whole slew of species.

This thread makes it seem like rock-paper-scissors “pecking order” is not that uncommon, at least among r/BackYardChickens subreddit participants.

I have a moderately-anti-status-ladders-per-se-being-important discussion in §2.5.1 here.

I think every part of your post where you rely on the existence of a strict status ladder, could be lightly rephrased to not rely on that, without any substantive change.

Once pointed out, that also sounds like how human status tends to work! The new hire at the company, the new kid at school, the new member to the social group, the visitor at another’s house… all these people typically have very low dominance-status, at least within their new context.

I think it comes from being more confident / comfortable in a familiar environment. There’s some game theory at play, see §2.5.4 here.

Comment by Steven Byrnes (steve2152) on Wei Dai's Shortform · 2024-08-27T18:52:10.523Z · LW · GW

I was just reading Daniel Dennett’s memoir for no reason in particular, it had some interesting glimpses into how professional philosophers actually practice philosophy. Like I guess there’s a thing where one person reads their paper (word-for-word!) and then someone else is the designated criticizer? I forget the details. Extremely different from my experience in physics academia though!!

(Obviously, reading that memoir is probably not the most time-efficient way to learn about the day-to-day practice of academic philosophy.)

(Oh, there was another funny anecdote in the memoir where the American professional philosopher association basically had a consensus against some school of philosophy, and everyone was putting it behind them and moving on, but then there was a rebellion where the people who still liked that school of philosophy did a hostile takeover of the association’s leadership!) 

Academic culture/norms - no or negative rewards for being more modest or expressing confusion. (Moral uncertainty being sometimes expressed because one can get rewarded by proposing some novel mechanism for dealing with it.)

A non-ethics example that jumps to my mind is David Chalmers on the Hard Problem of Consciousness here: “So if I’m giving my overall credences, I’m going to give, 10% to illusionism, 30% to panpsychism, 30% to dualism, and maybe the other 30% to, I don’t know what else could be true, but maybe there’s something else out there.” That’s the only example I can think of but I read very very little philosophy.

Comment by Steven Byrnes (steve2152) on Limitations on Formal Verification for AI Safety · 2024-08-21T14:18:31.009Z · LW · GW

(I probably agree about formal verification. Instead, I’m arguing the narrow point that I think if someone were to simulate liquid water using just the Standard Model Lagrangian as we know it today, with no adjustable parameters and no approximations, on a magical hypercomputer, then they would calculate a freezing point that agrees with experiment. If that’s not a point you care about, then you can ignore the rest of this comment!)

OK let’s talk about getting from the Standard Model + weak-field GR to the freezing point of water. The weak force just leads to certain radioactive decays—hopefully we’re on the same page that it has well-understood effects that are irrelevant to water. GR just leads to Newton’s Law of Gravity which is also irrelevant to calculating the freezing point of water. Likewise, neutrinos, muons, etc. are all irrelevant to water.

Next, the strong force, quarks and gluons. That leads to the existence of nuclei, and their specific properties. I’m not an expert but I believe that the standard model via “lattice QCD” predicts the proton mass pretty well, although you need a supercomputer for that. So that’s the hydrogen nucleus. What about the oxygen nucleus? A quick google suggests that simulating an oxygen nucleus with lattice QCD is way beyond what today’s supercomputers can do (seems like the SOTA is around two nucleons, whereas oxygen has 16). So we need an approximation step, where we say that the soup of quarks and gluons approximately condenses into sets of quark-triples (nucleons) that interact by exchanging quark-doubles (pions). And then we get the nuclear shell model etc. Well anyway, I think there’s very good reason to believe that someone could turn the standard model and a hypercomputer into the list of nuclides in agreement with experiment; if you disagree, we can talk about that separately.

OK, so we can encapsulate all those pieces and all that’s left are nuclei, electrons, and photons—a.k.a. quantum electrodynamics (QED). QED is famously perhaps the most stringently tested theory in science, with two VERY different measurements of the fine structure constant agreeing to 1 part in 1e8 (like measuring the distance from Boston to San Francisco using two very different techniques and getting the same answer to within 4 cm—the techniques are probably sound!).

But those are very simple systems; what if QED violations are hiding in particle-particle interactions? Well, you can do spectroscopy of atoms with two electrons and a nucleus (helium or helium-like), and we still get up to parts-per-million agreement with no-adjustable-parameter QED predictions, and OK yes this says there’s a discrepency very slightly (1.7×) outside the experimental uncertainty bars but historically it’s very common for people to underestimate their experimental uncertainty bars by that amount.

But that’s still only two electrons and a nucleus; what about water with zillions of atoms and electrons? Maybe there’s some behavior in there that contradicts QED?

For one thing, it’s hard and probably impossible to just posit some new fundamental physics phenomenon that impacts a large aggregate of atoms without having any measurable effect on precision atomic measurements, particle accelerator measurements, and so on. Almost any fundamental physics phenomenon that you write down would violate some symmetry or other principle that seems to be foundational, or at any rate, that has been tested at even higher accuracy than the above (e.g. the electron charge and proton charge are known to be exact opposites to 1e-21 accuracy, the vacuum dispersion is zero to 1e18 accuracy … there are a ton of things like that that tend to be screwed up by any fundamental physics phenomenon that is not of a very specific type, namely a term that looks like quantum field theory as we know it today).

For another thing, ab initio molecular simulations exist and do give results compatible with macroscale material properties, which might or might not include the freezing point of water (this seems related but I’m not sure upon a quick google). “Ab initio” means “starting from known fundamental physics principles, with no adjustable parameters”.

Now, I’m sympathetic to the conundrum that you can open up some paper that describes itself as an “ab initio”, and OK if the authors are not outright lying then we can feel good that there are no adjustable parameters in the source code as such. But surely the authors were making decisions about how to set up various approximations. How sure are we that they weren’t just messing around until they got the right freezing point, IR spectrum, shear strength, or whatever else they were calculating?

I think this is a legitimate hypothesis to consider and I’m sure it’s true of many individual papers. I’m not sure how to make it legible, but I have worked in molecular dynamics myself and had extremely smart and scrupulous friends in really good molecular dynamics labs such that I could see how they worked. And I don’t think the above paragraph concern is a correct description of the field. I think there’s a critical mass of good principled researchers who can recognize when people are putting more into the simulations than they get out, and keep the garbage studies out of textbooks and out of open-source tooling.

I guess one legible piece of evidence is that DFT was the best (and kinda only) approximation scheme that lets you calculate semiconductor bandgaps from first principles with reasonable amounts of compute, for many decades. And DFT famously always gives bandgaps that are too small. Everybody knew that, and that means that nobody was massaging their results to get the right bandgap. And it means that whenever people over the decades came up with some special-pleading correction that gave bigger bandgaps, the field as a whole wasn’t buying it. And that’s a good sign! (My impression is that people now have more compute-intensive techniques that are still ab initio and still “principled” but which give better bandgaps.)

Comment by Steven Byrnes (steve2152) on Limitations on Formal Verification for AI Safety · 2024-08-21T02:55:08.592Z · LW · GW

FWIW I’m with Steve O here, e.g. I was recently writing the following footnote in a forthcoming blog post:

“The Standard Model of Particle Physics plus perturbative quantum general relativity” (I wish it was better-known and had a catchier name) appears sufficient to explain everything that happens in the solar system. Nobody has ever found any experiment violating it, despite extraordinarily precise tests. This theory can’t explain everything that happens in the universe—in particular, it can’t make any predictions about either (A) microscopic exploding black holes or (B) the Big Bang. Also, (C) the Standard Model happens to includes 18 elementary particles (depending on how you count), because those are the ones we’ve discovered; but the theoretical framework is fully compatible with other particles existing too, and indeed there are strong theoretical and astronomical reasons to think they do exist. It’s just that those other particles are irrelevant for anything happening on Earth. Anyway, all signs point to some version of string theory eventually filling in those gaps as a true Theory of Everything. After all, string theories seem to be mathematically well-defined, to be exactly compatible with general relativity, and to have the same mathematical structure as the Standard Model of Particle Physics (i.e., quantum field theory) in the situations where that’s expected. Nobody has found a specific string theory vacuum with exactly the right set of elementary particles and masses and so on to match our universe. And maybe they won’t find that anytime soon—I’m not even sure if they know how to do those calculations! But anyway, there doesn’t seem to be any deep impenetrable mystery between us and a physics Theory of Everything.

(I interpret your statement to be about everyday experiences which depend on something being incomplete / wrong in fundamental physics as we know it, as opposed to just saying the obvious fact that we don’t understand all the emergent consequences of fundamental physics as we know it.)

I also think “we basically have no ability to model any high-level phenomena using quantum field theory” is misleading. It’s true that we can’t directly use the Standard Model Lagrangian to simulate a transistor. But we do know how and why and to what extent quantum field theory reduces to normal quantum mechanics and quantum chemistry (to such-and-such accuracy in such-and-such situations), and we know how those in turn approximately reduce to fluid dynamics and solid mechanics and classical electromagnetism and so on (to such-and-such accuracy in such-and-such situations), and now we’re all the way at the normal set of tools that physicists / chemists / engineers actually use to model high-level phenomena. You’re obviously losing fidelity at each step of simplification, but you’re generally losing fidelity in a legible way—you’re making specific approximations, and you know what you’re leaving out and why omitting it is appropriate in this situation, and you can do an incrementally more accurate calculation if you need to double-check. Do you see what I mean?

By (loose) analogy, someone could say “we don’t know for sure that intermolecular gravitational interactions are irrelevant for the freezing point of water, because nobody has ever included intermolecular gravitational interactions in a molecular dynamics calculation”. But the reason nobody has ever included them in a calculation is because we know for sure that they’re infinitesimal and irrelevant. Likewise, a lot of the complexity of QFT is infinitesimal and irrelevant in any particular situation of interest.

Comment by Steven Byrnes (steve2152) on dddiiiiitto's Shortform · 2024-08-20T00:59:06.774Z · LW · GW

I haven’t looked at that report in particular, but I VERY quickly looked into fluoride 6 months ago for my own decision-making purposes, and I wound up feeling like (1) a bunch of the studies are confounded by the fact that polluted areas have more fluoride, and people with more income / education / etc. [which are IQ correlates] are better at avoiding living in polluted areas and drinking the water, (2) getting fluoride out of my tap water is sufficiently annoying / weird that I don’t immediately want to bother in the absence of stronger beliefs (e.g. normal activated carbon filters don’t get the fluoride out), (3) I should brush with normal toothpaste then rinse with water, then use fluoride mouthwash right before bed (and NOT rinse with water afterwards, but do try extra hard to spit out as much of it as possible), (4) use fluoride-free toothpaste for the kids until they’re good at spitting it out (we were already doing this, I think it’s standard practice), but then switch.

I’m very open to (1) being wrong and any of (2-4) being the wrong call. FWIW, where I live, the tap water is 0.7mg/L.

Comment by Steven Byrnes (steve2152) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-15T12:35:07.957Z · LW · GW

Sure, but the way it's described, it sounds like there's one adjustable parameter in the source code. If the setup allows for thousands of independently-adjustable parameters in the source code, that seems potentially useful but I'd want to know more details.

Comment by Steven Byrnes (steve2152) on Ten counter-arguments that AI is (not) an existential risk (for now) · 2024-08-14T01:05:43.435Z · LW · GW

I think it's unlikely we get there in the foreseeable future, with the current paradigms

It would be nice if you could define “foreseeable future”. 3 years? 10 years? 30? 100? 1000? What?

And I’m not sure why “with the current paradigms” is in that sentence. The post you’re responding to is “Ten arguments that AI is an existential risk”, not “Ten arguments that Multimodal Large Language Models are an existential risk”, right?

If your assumption is that “the current paradigms” will remain the current paradigms for the “foreseeable future”, then you should say that, and explain why you think so. It seems to me that the paradigm in AI has had quite a bit of change in the last 6 years (i.e. since 2018, before GPT-2, i.e. a time when few had heard of LLMs), and has had complete wrenching change in the last 20 years (i.e. since 2004, many years before AlexNet, and a time when deep learning as a whole was still an obscure backwater, if I understand correctly). So by the same token, it’s plausible that the field of AI might have quite a bit of change in the next 6 years, and complete wrenching change in the next 20 years, right?

Comment by Steven Byrnes (steve2152) on Leaving MIRI, Seeking Funding · 2024-08-08T20:27:34.257Z · LW · GW

I just signed up for the Patreon and encourage others to do the same! Abram has done a lot of good work over the years—I’ve learned a lot of important things, things that affect my own research and thinking about AI alignment, by reading his writing.

Comment by Steven Byrnes (steve2152) on steve2152's Shortform · 2024-08-08T13:32:24.560Z · LW · GW

I just made a wording change from:

Normies like me have an intuitive mental concept “me” which is simultaneously BOTH (A) me-the-human-body-etc AND (B) me-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.

to:

Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.

I think that’s closer to what I was trying to get across. Does that edit change anything in your response?

At least the 'me-the-human-body' part of the concept. I don't know what the '-etc' part refers to.

The “etc” would include things like the tendency for fingers to reactively withdraw from touching a hot surface.

Elaborating a bit: In my own (physicalist, illusionist) ontology, there’s a body with a nervous system including the brain, and the whole mental world including consciousness / awareness is inextricably part of that package. But in other people’s ontology, as I understand it, some nervous system activities / properties (e.g. a finger reactively withdrawing from pain, maybe some or all other desires and aversions) gets lumped in with the body, whereas other [things that I happen to believe are] nervous system activities / properties (e.g. awareness) gets peeled off into (B). So I said “etc” to include all the former stuff. Hopefully that’s clear.

(I’m trying hard not to get sidetracked into an argument about the true nature of consciousness—I’m stating my ontology without defending it.)

Comment by Steven Byrnes (steve2152) on steve2152's Shortform · 2024-08-08T02:09:29.332Z · LW · GW

Many helpful replies! Here’s where I’m at right now (feel free to push back!) [I’m coming from an atheist-physicalist perspective; this will bounce off everyone else.]

Hypothesis:

Normies like me (Steve) have an intuitive mental concept “Steve” which is simultaneously BOTH (A) Steve-the-human-body-etc AND (B) Steve-the-soul / consciousness / wellspring of vitalistic force / what Dan Dennett calls a “homunculus” / whatever.

The (A) & (B) “Steve” concepts are the same concept in normies like me, or at least deeply tangled together. So it’s hard to entertain the possibility of them coming apart, or to think through the consequences if they do.

Some people can get into a Mental State S (call it a form of “enlightenment”, or pick your favorite terminology) where their intuitive concept-space around (B) radically changes—it broadens, or disappears, or whatever. But for them, the (A) mental concept still exists and indeed doesn’t change much.

Anyway, people often have thoughts that connect sense-of-self to motivation, like “not wanting to be embarrassed” or “wanting to keep my promises”. My central claim that the relevant sense-of-self involved in that motivation is (A), not (B).

If we conflate (A) & (B)—as normies like me are intuitively inclined to do—then we get the intuition that a radical change in (B) must have radical impacts on behavior. But that’s wrong—the (A) concept is still there and largely unchanged even in Mental State S, and it’s (A), not (B), that plays a role in those behaviorally-important everyday thoughts like “not wanting to be embarrassed” or “wanting to keep my promises”. So radical changes in (B) would not (directly) have the radical behavioral effects that one might intuitively expect (although it does of course have more than zero behavioral effect, with self-reports being an obvious example).

End of hypothesis. Again, feel free to push back!

Comment by Steven Byrnes (steve2152) on steve2152's Shortform · 2024-08-06T17:11:14.965Z · LW · GW

I’m intrigued by the reports (including but not limited to the Martin 2020 “PNSE” paper) that people can “become enlightened” and have a radically different sense of self, agency, etc.; but friends and family don’t notice them behaving radically differently, or even differently at all. I’m trying to find sources on whether this is true, and if so, what’s the deal. I’m especially interested in behaviors that (naïvely) seem to centrally involve one’s self-image, such as “applying willpower” or “wanting to impress someone”. Specifically, if there’s a person whose sense-of-self has dissolved / merged into the universe / whatever, and they nevertheless enact behaviors that onlookers would conventionally put into one of those two categories, then how would that person describe / conceptualize those behaviors and why they occurred? (Or would they deny the premise that they are still exhibiting those behaviors?) Interested in any references or thoughts, or email / DM me if you prefer. Thanks in advance!

(Edited to add: Ideally someone would reply: “Yeah I have no sense of self, and also I regularly do things that onlookers describe as ‘applying willpower’ and/or ‘trying to impress someone’. And when that happens, I notice the following sequence of thoughts arising: [insert detailed description]”.)

[also posted on twitter where it got a bunch of replies including one by Aella.]

Comment by Steven Byrnes (steve2152) on Circular Reasoning · 2024-08-06T02:07:45.518Z · LW · GW

I edited to clarify that “reason to believe that they’re correct, other things equal” and “reason to take seriously” is meant in the sense of “a pro tanto reason” not “an incontrovertible proof”. Sorry, I thought that was obvious. (Note that it was also explained in the OP.)

To give some examples:

  • If you ask a crackpot physicist and a real physicist to each define 10 electromagnetism-related terms and then make 20 substantive claims using those terms, I would bet that the crackpot has a higher probability of saying multiple things that contradict each other. (Not a 100% probability, just higher.)
    • …and if the crackpot physicist really said only things that hung together perfectly and self-consistently, including after follow-up questions, then I would start to entertain possibilities like “maybe they’re describing true things but starting from idiosyncratic nonstandard definitions?” or “maybe they’re consistently describing a certain approximation to electromagnetism?” etc.
  • Likewise, I would bet on myself over a biblical literalist to be able to make lots of complex claims about the nature of the universe, and humanity, etc., including follow-up questions, in a way that hangs together without anything being internally inconsistent.
Comment by Steven Byrnes (steve2152) on Circular Reasoning · 2024-08-05T20:50:36.991Z · LW · GW

Yeah I agree. If someone has a bunch of beliefs, and they all hang together in a self-consistent way, that’s a reason to believe that they’re correct, other things equal. (UPDATE: I’m saying it’s a pro tanto reason—obviously it’s not a proof of correctness!)

This applies to both little things—if you offer a bunch of claims and definitions about how capacitors work, and they all hang together in a self-consistent way, that’s a reason to take those claims and definitions seriously—and big things—if you offer a whole worldview, including very basic things like how do we interpret observations and make predictions and what’s the nature of reality etc., and everything in it hangs together in a self-consistent way, then that’s a reason to take that worldview seriously.

Comment by Steven Byrnes (steve2152) on A Simple Toy Coherence Theorem · 2024-08-04T17:53:55.403Z · LW · GW

It's not clear to me that specifying "preferences over future states" actually restricts things much - if I have some preferences over the path I take through lotteries, then whether I take path A or path B to reach outcome X will show up as some difference in the final state, so it feels like we can cast a lot (Most? All?) types of preferences as "preferences over future states".

In terms of the OP toy model, I think the OP omitted another condition under which the coherence theorem is trivial / doesn’t apply, which is that you always start the MDP in the same place and the MDP graph is a directed tree or directed forest. (i.e., there are no cycles even if you ignore the arrow-heads … I hope I’m getting the graph theory terminology right). In those cases, for any possible end-state, there’s at most one way to get from the start to the end-state; and conversely, for any possible path through the MDP, that’s the path that would result from wanting to get to that end-state. Therefore, you can rationalize any path through the MDP as the optimal way to get to whatever end-state it actually gets to. Right? (cc @johnswentworth @David Lorell )

OK, so what about the real world? The laws of physics are unitary, so it is technically true that if I have some non-distant-future-related preferences (e.g. “I prefer to never tell a lie”, “I prefer to never use my pinky finger”, etc.), this preference can be cast as some inscrutably complicated preference about the state of the world on January 1 2050, assuming omniscient knowledge of the state of the world right now and infinite computational power. For example, “a preference to never use my pinky finger starting right now” might be equivalent to something kinda like “On January 1 2050, IF {air molecule 9834705982347598 has speed between 34.2894583000000 and 34.2894583000001 AND air molecule 8934637823747621 has … [etc. for a googolplex more lines of text]”

This is kind of an irrelevant technicality, I think. The real world MDP in fact is full of (undirected) cycles—i.e. different ways to get to the same endpoint—…as far as anyone can measure it. For example, let’s say that I care about the state of a history ledger on January 1 2050. Then it’s possible for me to do whatever for 25 years … and then hack into the ledger and change it!

However, if the history ledger is completely unbreachable (haha), then I think we should say that this isn’t really a preference about the state of the world in the distant future, but rather an implementation method for making an agent with preferences about trajectories.

Comment by Steven Byrnes (steve2152) on A Simple Toy Coherence Theorem · 2024-08-04T15:41:45.805Z · LW · GW

"Utility maximisers are scary, and here are some theorems that show that anything sufficiently smart/rational (i.e. a superintelligence) will be a utility maximiser. That's scary"

I would say "systems that act according to preferences about the state of the world in the distant future are scary", and then that can hopefully lead to a productive and substantive discussion about whether people are likely to build such systems. (See e.g. here where I argue that someone is being too pessimistic about that, & section 1 here where I argue that someone else is being too optimistic.)

Comment by Steven Byrnes (steve2152) on Elizabeth's Shortform · 2024-08-03T23:42:47.123Z · LW · GW

My post “The “mind-body vicious cycle” model of RSI & back pain” describes another (alleged) example of Bad Equilibrium Disease.

Comment by Steven Byrnes (steve2152) on Martín Soto's Shortform · 2024-08-03T23:10:29.534Z · LW · GW

FWIW, I was just arguing here & here that I find it plausible that a near-future AI could pass a 2-hour Turing test while still being a paradigm-shift away from passing a 100-hour Turing test (or from being AGI / human-level intelligence in the relevant sense).

Comment by Steven Byrnes (steve2152) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-01T14:14:09.835Z · LW · GW

To add onto this comment, let’s say there’s self-other overlap dial—e.g. a multiplier on the KL divergence or whatever.

  • When the dial is all the way at the max setting, you get high safety and terribly low capabilities. The AI can’t explain things to people because it assumes they already know everything the AI knows. The AI can't conceptualize the idea that if Jerome is going to file the permits, then the AI should not itself also file the same permits. The AI wants to eat food, or else the AI assumes that Jerome does not want to eat food. The AI thinks it has arms, or else thinks that Jerome doesn’t. Etc.
  • When the dial is all the way at the zero setting, it’s not doing anything—there’s no self-other overlap penalty term in the training. No safety, but also no capabilities tax.

So as you move the dial from zero to max, at some point (A) the supposed safety benefits start arising, and also at some point (B) the capabilities issues start arising. I think the OP is assuming without argument that (A) happens first, and (B) happens second. If it’s the other way around—(B) happens first, and (A) happens much later, as you gradually crank up the dial—then it’s not a problem you can solve with “minimal self-other distinction while maintaining performance”, instead the whole approach is doomed, right?

I think a simple intuition that (B) would happen before (A) is just that the very basic idea that “different people (and AIs) have different beliefs”, e.g. passing the Sally-Anne test, is already enough to open the door to AI deception, but also a very basic requirement for capabilities, one would think.

Comment by Steven Byrnes (steve2152) on Self-Other Overlap: A Neglected Approach to AI Alignment · 2024-08-01T13:40:43.464Z · LW · GW

I don’t think that’s a good response to Charlie’s complaint because you’re kinda talking about a different thing.

  • What you’re talking about is: maybe the AI can have a sense-of-self that also encompasses another person (call him Jerome), analogous to how I have a sense-of-self that also encompasses my foot.
  • What OP is talking about is: maybe the AI can be unable (or barely able) to conceptualize the idea that it has one set of beliefs / desires / etc. and Jerome has a different set of beliefs / desires / etc., analogous to how humans have a hard time remembering that their beliefs were different in the past than they are now (hence hindsight bias), or in interpersonal contexts, how people sometimes suffer the illusion of transparency.

The first thing doesn’t make deception impossible, which is the selling point of the OP. For example, enlightened Buddhists supposedly feel “no self” and at one with the universe (or whatever), but they are still obviously capable of understanding that different people have different beliefs. (Otherwise they wouldn’t write books, because they would assume that everyone would already know what they know! Indeed, they would falsely believe that everyone is already enlightened, because they themselves are!)

Or for your example: I consider my foot part of myself, but I am very capable of conceptualizing the idea that I know calculus but my foot does not know calculus. I’m not sure that it’s meaningful to “deceive my own foot”, any more than I can “deceive” a rock, but for what it’s worth I can certainly put a topical anesthetic on my foot and then my foot will fail to transmit pain signals in the circumstances where it normally would, which is maybe very vaguely analogous to deceiving my foot about what it’s sensing.

Comment by Steven Byrnes (steve2152) on Decomposing Agency — capabilities without desires · 2024-08-01T11:58:16.111Z · LW · GW

It seems pretty obvious to me that if (1) if a species of bacteria lives in an extremely uniform / homogeneous / stable external environment, it will eventually evolve to not have any machinery capable of observing and learning about its external environment; (2) such a bacterium would still be doing lots of complex homeostasis stuff, reproduction, etc., such that it would be pretty weird to say that these bacteria have fallen outside the scope of Active Inference theory. (I.e., my impression was that the foundational assumptions / axioms of Free Energy Principle / Active Inference were basically just homeostasis and bodily integrity, and this hypothetical bacterium would still have both of those things.) (Disclosure: I’m an Active Inference skeptic.)

Comment by Steven Byrnes (steve2152) on Against AI As An Existential Risk · 2024-07-31T02:55:48.301Z · LW · GW

For what it’s worth, I find that you are equivocating in a strange way between endorsing and not endorsing these arguments.

On the one hand, here in this post you called them “the best arguments” and “tell me why I’m wrong”, which sounds a lot like an endorsement. And your post title also sounds an awful lot like an endorsement.

On the other hand, in the substack text, you say at the top that you don’t have an opinion, and you state objections without stating in your own voice that you think the objections are any good. For example, “Yann LeCun argues that the need to dominate is purely a social phenomena that does not develop because of intelligence.” Well, yes, that is true, Yann LeCun does say that. But do you think it’s a good argument? If so, you should say that! (I sure don't!—See e.g. here.)

I think you should pick one or the other rather than equivocating. If you really don’t know where you stand, then you should retitle your post etc. Or if you find some of the arguments compelling, you should say that.

Comment by Steven Byrnes (steve2152) on Against AI As An Existential Risk · 2024-07-31T02:43:20.526Z · LW · GW

+1 to one of the things Charlie said in his comment, but I’d go even further:

The proposition “The current neural architecture paradigm can scale up to Artificial General Intelligence (AGI) (especially without great breakthroughs)” is not only unnecessary for the proposition “AI is an extinction threat” to be true, it’s not even clear that it’s evidence for the proposition “AI is an extinction threat”! One could make a decent case that it’s evidence against “AI is an extinction threat”! That argument would look like: “we’re gonna make AGI sooner or later, and LLMs are less dangerous than alternative AI algorithms for the following reasons …”.

As an example, Yann LeCun thinks AGI will be a different algorithm rather than LLMs, and here’s my argument that the AGI algorithm LeCun expects is actually super-dangerous. (LeCun prefers a different term to “AGI” but he’s talking about the same thing.)

I’m trying to figure where you were coming from that you brought up “The current neural architecture paradigm can scale up to Artificial General Intelligence (AGI) (especially without great breakthroughs)” as a necessary part of the argument.

One possibility is, you’re actually interested in the question of whether transformer-architecture self-supervised (etc.) AI is an extinction threat or not. If so, that’s a weirdly specific question, right? If it’s not an extinction threat, but a different AI algorithm is, that would sure be worth mentioning, right? But fine. If you’re interested in that narrow question, then I think your post should have been titled “against transformer-architecture self-supervised (etc) AI as an extinction threat” right? Related: my post here.

Another possibility is, you think that the only two options are, either (1) the current paradigm scales to AGI, or (2) AGI is impossible or centuries away. If so, I don’t know why you would think that. For example, Yann LeCun and François Chollet are both skeptical of LLMs, but separately they both think AGI (based on a non-LLM algorithm) is pretty likely in the next 20 years (source for Chollet). I’m more or less in that camp too. See also my brief comment here.

Comment by Steven Byrnes (steve2152) on Linda Linsefors's Shortform · 2024-07-30T19:41:41.566Z · LW · GW

On receiving compliments

I always reply “That’s very kind of you to say.” Especially for compliments that I disagree with but don’t want to get into an argument about. I think it expresses nice positive vibes without actually endorsing the compliment as true.

On paying forward instead of paying back

A good mission-aligned team might be another example? In sports, if I pass you the ball and you score a goal, that’s not a “favor” because we both wanted the goal to be scored. (Ideally.) Or if we’re both at a company, and we’re both passionate about the mission, and your computer breaks and I fix it, that’s not necessarily “a favor” because I want your computer to work because it’s good for the project and I care about that. (Ideally.) Maybe you’re seeing some EAs feel that kind of attitude?

Comment by Steven Byrnes (steve2152) on AI existential risk probabilities are too unreliable to inform policy · 2024-07-29T21:08:31.397Z · LW · GW

My brief complaints about that article (from twitter here):

My complaints about that essay would be: (1) talking about factors that might bias people’s p(doom) too high but not mentioning factors that might bias people’s p(doom) too low; (2) implicitly treating “p(doom) is unknowable” as evidence for “p(doom) is very low”; (3) Dismissing the possibility of object-level arguments. E.g. for (2), they say “govts should adopt policies that are compatible with a range of possible estimates of AI risk, and are on balance helpful even if the risk is negligible”. Why not “…even if the risk is high”? I agree that the essay has many good parts, and stands head-and-shoulders above much of the drivel that comprises the current discourse 😛

(…and then downthread there’s more elaboration on (2).)

Comment by Steven Byrnes (steve2152) on Koan: divining alien datastructures from RAM activations · 2024-07-23T16:39:15.931Z · LW · GW

I don’t think we disagree much if at all.

I think constructing a good theoretical framework is very hard, so people often do other things instead, and I think you’re using the word “legible” to point to some of those other things.

  • I’m emphasizing that those other things are less than completely useless as semi-processed ingredients that can go into the activity of “constructing a good theoretical framework”
  • You’re emphasizing that those other things are not themselves the activity of “constructing a good theoretical framework”, and thus can distract from that activity, or give people a false sense of how much progress they’re making.

I think those are both true.

The pre-Darwin ecologists were not constructing a good theoretical framework. But they still made Darwin’s job easier, by extracting slightly-deeper patterns for him to explain with his much-deeper theory—concepts like “species” and “tree of life” and “life cycles” and “reproduction” etc. Those concepts were generally described by the wrong underlying gears before Darwin, but they were still contributions, in the sense that they compressed a lot of surface-level observations (Bird A is mating with Bird B, and then Bird B lays eggs, etc.) into a smaller number of things-to-be-explained. I think Darwin would have had a much tougher time if he was starting without the concepts of “finch”, “species”, “parents”, and so on.

By the same token, if we’re gonna use language as a datapoint for building a good underlying theoretical framework for the deep structure of knowledge and ideas, it’s hard to do that if we start from slightly-deep linguistic patterns (e.g. “morphosyntax”, “sister schemas”)… But it’s very much harder still to do that if we start with a mass of unstructured surface-level observations, like particular utterances.

I guess your perspective (based on here) is that, for the kinds of things you’re thinking about, people have not been successful even at the easy task of compressing a lot of surface-level observations into a smaller number of slightly-deeper patterns, let alone successful at the the much harder task of coming up with a theoretical framework that can deeply explain those slightly-deeper patterns? And thus you want to wholesale jettison all the previous theorizing? On priors, I think that would be kinda odd. But maybe I’m overstating your radicalism. :)

Comment by Steven Byrnes (steve2152) on Koan: divining alien datastructures from RAM activations · 2024-07-23T14:06:20.077Z · LW · GW

Thanks!

One thing I would say is: if you have a (correct) theoretical framework, it should straightforwardly illuminate tons of diverse phenomena, but it’s very much harder to go backwards from the “tons of diverse phenomena” to the theoretical framework. E.g. any competent scientist who understands Evolution can apply it to explain patterns in finch beaks, but it took Charles Darwin to look at patterns in finch beaks and come up with the idea of Evolution.

Or in my own case, for example, I spent a day in 2021 looking into schizophrenia, but I didn’t know what to make of it, so I gave up. Then I tried again for a day in 2022, with a better theoretical framework under my belt, and this time I found that it slotted right into my then-current theoretical framework. And at the end of that day, I not only felt like I understood schizophrenia much better, but also my theoretical framework itself came out more enriched and detailed. And I iterated again in 2023, again simultaneously improving my understanding of schizophrenia and enriching my theoretical framework.

Anyway, if the “tons of diverse phenomena” are datapoints, and we’re in the middle of trying to come up with a theoretical framework that can hopefully illuminate all those datapoints, then clearly some of those datapoints are more useful than others (as brainstorming aids for developing the underlying theoretical framework), at any particular point in this process. The “schizophrenia” datapoint was totally unhelpful to me in 2021, but helpful to me in 2022. The “precession of Mercury” datapoint would not have helped Einstein when he was first brainstorming general relativity in 1907, but was presumably moderately helpful when he was thinking through the consequences of his prototype theory a few years later.

The particular phenomena / datapoints that are most useful for brainstorming the underlying theory (privileging the hypothesis), at any given point in the process, need not be the most famous and well-studied phenomena / datapoints. Einstein wrung much more insight out of the random-seeming datapoint “a uniform gravity field seems an awful lot like uniform acceleration” than out of any of the datapoints that would have been salient to a lesser gravity physicist, e.g. Newton’s laws or the shape of the galaxy or the Mercury precession. In my own case, there are random experimental neuroscience results (or everyday observations) that I see as profoundly revealing of deep truths, but which would not be particularly central or important from the perspective of other theoretical neuroscientists.

But, I don’t see why “legible phenomena” datapoints would be systematically worse than other datapoints. (Unless of course you’re also reading and internalizing crappy literature theorizing about those phenomena, and it’s filling your mind with garbage ideas that get in the way of constructing a better theory.) For example, the phenomenon “If I feel cold, then I might walk upstairs and put on a sweater” is “legible”, right? But if someone is in the very early stages of developing a theoretical framework related to goals and motivations, then they sure need to have examples like that in the front of their minds, right? (Or maybe you wouldn't call that example “legible”?)

Comment by Steven Byrnes (steve2152) on Koan: divining alien datastructures from RAM activations · 2024-07-22T15:59:55.322Z · LW · GW

Can you elaborate on why you think “studying the algorithms involved in grammatically parsing a sentence” is not “a good way to get at the core of how minds work”?

For my part, I’ve read a decent amount of pure linguistics (in addition to neuro-linguistics) over the past few years, and find it to be a fruitful source of intuitions and hypotheses that generalize way beyond language. (But I’m probably asking different questions than you.)

I wonder if you’re thinking of, like, the nuts-and-bolts of syntax of specific languages, whereas I’m thinking of broader / deeper theorizing (random example), maybe?

Comment by Steven Byrnes (steve2152) on Eli's shortform feed · 2024-07-22T14:31:42.778Z · LW · GW

In Section 1 of this post I make an argument kinda similar to the one you’re attributing to Eliezer. That might or might not help you, I dunno, just wanted to share.

Comment by Steven Byrnes (steve2152) on Towards more cooperative AI safety strategies · 2024-07-21T13:57:37.835Z · LW · GW

the goal remains to implement CEV or something like it, and optimize the universe according to the resulting utility function

I think you mean “the goal remains to ensure that CEV or something like it is eventually implemented, and the universe is thus optimized according to the resulting utility function”, right? I think Eliezer’s view has always been that we want a CEV-maximizing ASI to be eventually turned on, but if that happens, it wouldn’t matter which human turns it on. And then evidently Eliezer has pivoted over the decades from thinking that this is likeliest to happen if he tries to build such an ASI with his own hands, to no longer thinking that.

Comment by Steven Byrnes (steve2152) on What are the actual arguments in favor of computationalism as a theory of identity? · 2024-07-19T02:38:42.649Z · LW · GW

A starting point is self-reports. If I truthfully say “I see my wristwatch”, then, somewhere in the chain of causation that eventually led to me uttering those words, there’s an actual watch, and photons are bouncing off it and entering my eyes then stimulating neurons etc.

So by the same token, if I say “your phenomenal consciousness is a salty yellow substance that smells like bananas and oozes out of your bellybutton”, and then you reply “no it isn’t!”, then let’s talk about how it is that you are so confident about that.

(I’m using “phenomenal consciousness” as an example, but ditto for “my sense of self / identity” or whatever else.)

So here, you uttered a reply (“No it isn’t!”). And we can assume that somewhere in the chain of causation is ‘phenomenal consciousness’ (whatever that is, if anything), and you were somehow introspecting upon it in order to get that information. You can’t know things in any other way—that’s the basic, hopefully-obvious point that I understand Eliezer was trying to make here.

Now, what’s a ‘chain of causation’, in the relevant sense? Let’s start with a passage from Age of Em:

The brain does not just happen to transform input signals into state changes and output signals; this transformation is the primary function of the brain, both to us and to the evolutionary processes that designed brains. The brain is designed to make this signal processing robust and efficient. Because of this, we expect the physical variables (technically, “degrees of freedom”) within the brain that encode signals and signal-relevant states, which transform these signals and states, and which transmit them elsewhere, to be overall rather physically isolated and disconnected from the other far more numerous unrelated physical degrees of freedom and processes in the brain. That is, changes in other aspects of the brain only rarely influence key brain parts that encode mental states and signals.

In other words, if your body temperature had been 0.1° colder, or if you were hanging upside down, or whatever, then the atoms in your brain would be configured differently in all kinds of ways … but you would still say “no it isn’t!” in response to my proposal that maybe your phenomenal consciousness is a salty yellow substance that oozes out of your bellybutton. And you would say it for the exact same reason.

This kind of thinking leads to the more general idea that the brain has inputs (e.g. photoreceptor cells), outputs (e.g. motoneurons … also, fun fact, the brain is a gland!), and algorithms connecting them. Those algorithms describe what Hanson’s “degrees of freedom” are doing from moment to moment, and why, and how. Whenever brains systematically do characteristically-brain-ish things—things like uttering grammatical sentences rather than moving mouth muscles randomly—then the explanation of that systematic pattern lies in the brain’s inputs, outputs, and/or algorithms. Yes, there’s randomness in what brains do, but whenever brains do characteristically-brainy-things reliably (e.g. disbelieve, and verbally deny, that your consciousness is a salty yellow substance that oozes out of your bellybutton), those things are evidently not the result of random fluctuations or whatever, but rather they follow from the properties of the algorithms and/or their inputs and outputs.

That doesn’t quite get us all the way to computationalist theories of consciousness or identity. Why not? Well, here are two ways I can think of to be non-computationalist within physicalism:

  • One could argue that consciousness & sense-of-identity etc. are just confused nonsense reifications of mental models with no referents at all, akin to “pure white” [because white is not pure, it’s a mix of wavelengths]. (Cf. “illusionism”.) I’m very sympathetic to this kind of view. And you could reasonably say “it’s not a computationalist theory of consciousness / identity, but rather a rejection of consciousness / identity altogether!” But I dunno, I think it’s still kinda computationalist in spirit, in the sense that one would presumably instead make the move of choosing to (re)define ‘consciousness’ and ‘sense-of-identity’ in such a way that those words point to things that actually exist at all (which is good), at the expense of being inconsistent with some of our intuitions about what those words are supposed to represent (which is bad). And when you make that move, those terms almost inevitably wind up pointing towards some aspect(s) of brain algorithms.
  • One could argue that we learn about consciousness & sense-of-identity via inputs to the brain algorithm rather than inherent properties of the algorithm itself—basically the idea that “I self-report about my phenomenal consciousness analogously to how I self-report about my wristwatch”, i.e. my brain perceives my consciousness & identity through some kind of sensory input channel, and maybe also my brain controls my consciousness & identity through some kind of motor or other output channel. If you believe something like that, then you could be physicalist but not a computationalist, I think. But I can’t think of any way to flesh out such a theory that’s remotely plausible.

I’m not a philosopher and am probably misusing technical terms in various ways. (If so, I’m open to corrections!)

(Note, I find these kinds of conversations to be very time-consuming and often not go anywhere, so I’ll read replies but am pretty unlikely to comment further. I hope this is helpful at all. I mostly didn’t read the previous conversation, so I’m sorry if I’m missing the point, answering the wrong question, etc.)

Comment by Steven Byrnes (steve2152) on steve2152's Shortform · 2024-07-18T16:23:36.506Z · LW · GW

I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links): 

I’ve learned a few things since writing “Intro to Brain-Like AGI safety” in 2022, so I went through and updated it! Each post has a changelog at the bottom if you’re curious. Most changes were in one the following categories: (1/7)

REDISTRICTING! As I previously posted ↓, I booted the pallidum out of the “Learning Subsystem”. Now it’s the cortex, striatum, & cerebellum (defined expansively, including amygdala, hippocampus, lateral septum, etc.) (2/7)

LINKS! I wrote 60 posts since first finishing that series. Many of them elaborate and clarify things I hinted at in the series. So I tried to put in links where they seemed helpful. For example, I now link my “Valence” series in a bunch of places. (3/7)

NEUROSCIENCE! I corrected or deleted a bunch of speculative neuro hypotheses that turned out wrong. In some early cases, I can’t even remember wtf I was ever even thinking! Just for fun, here’s the evolution of one of my main diagrams since 2021: (4/7)

EXAMPLES! It never hurts to have more examples! So I added a few more. I also switched the main running example of Post 13 from “envy” to “drive to be liked / admired”, partly because I’m no longer even sure envy is related to social instincts at all (oops) (5/7)

LLMs! … …Just kidding! LLMania has exploded since 2022 but remains basically irrelevant to this series. I hope this series is enjoyed by some of the six remaining AI researchers on Earth who don’t work on LLMs. (I did mention LLMs in a few more places though ↓ ) (6/7)

If you’ve already read the series, no need to do so again, but I want to keep it up-to-date for new readers. Again, see the changelogs at the bottom of each post for details. I’m sure I missed things (and introduced new errors)—let me know if you see any!

Comment by Steven Byrnes (steve2152) on A simple case for extreme inner misalignment · 2024-07-14T12:35:13.510Z · LW · GW

This doesn't sound like an argument Yudkowsky would make

Yeah, I can’t immediately find the link but I recall that Eliezer had a tweet in the past few months along the lines of: If ASI wants to tile the universe with one thing, then it wipes out humanity. If ASI wants to tile the universe with sixteen things , then it also wipes out humanity.

My mental-model-of-Yudkowsky would bring up “tiny molecular squiggles” in particular for reasons a bit more analogous to the CoastRunners behavior (video)—if any one part of the motivational system is (what OP calls) decomposable etc., then the ASI would find the “best solution” to maximizing that part. And if numbers matter, then the “best solution” would presumably be many copies of some microscopic thing.

Comment by Steven Byrnes (steve2152) on Most smart and skilled people are outside of the EA/rationalist community: an analysis · 2024-07-13T22:52:33.923Z · LW · GW

I use rationalist jargon when I judge that the benefits (of pointing to a particular thing) outweigh the costs (of putting off potential readers). And my opinion is that “epistemic status” doesn’t make the cut.

Basically, I think that if you write an “epistemic status” at the top of a blog post, and then delete the two words “epistemic status” while keeping everything else the same, it works just about as well. See for example the top of this post.

Comment by Steven Byrnes (steve2152) on Most smart and skilled people are outside of the EA/rationalist community: an analysis · 2024-07-12T15:40:39.137Z · LW · GW

(this comment is partly self-plagiarized from here)

Before doing any project or entering any field, you need to catch up on existing intellectual discussion on the subject.

I think this is way too strong. There are only so many hours in a day, and they trade off between

  • (A) “try to understand the work / ideas of previous thinkers” and
  • (B) “just sit down and try to figure out the right answer”.

It’s nuts to assert that the “correct” tradeoff is to do (A) until there is absolutely no (A) left to possibly do, and only then do you earn the right to start in on (B). People should do (A) and (B) in whatever ratio is most effective for figuring out the right answer. I often do (B), and I assume that I’m probably reinventing a wheel, but it’s not worth my time to go digging for it. And then maybe someone shares relevant prior work in the comments section. That’s awesome! Much appreciated! And nothing went wrong anywhere in this process! See also here.

A weaker statement would be “People in LW/EA commonly err in navigating this tradeoff, by doing too much (B) and not enough (A).” That weaker statement is certainly true in some cases. And the opposite is true in other cases. We can argue about particular examples, I suppose. I imagine that I have different examples in mind than you do.

~~

To be clear, I think your post has large kernels of truth and I’m happy you wrote it.

Comment by Steven Byrnes (steve2152) on Daniel Kokotajlo's Shortform · 2024-07-12T00:32:46.252Z · LW · GW

If you click my username it goes to my lesswrong user page, which has a “Message” link that you can click.

Comment by Steven Byrnes (steve2152) on Daniel Kokotajlo's Shortform · 2024-07-11T14:21:44.411Z · LW · GW

Related: Arbital postmortem.

Also, if anyone is curious to see another example, in 2007-8 there was a long series of extraordinarily time-consuming and frustrating arguments between me and one particular wikipedia editor who was very bad at physics but infinitely patient and persistent and rule-following. (DM me and I can send links … I don’t want to link publicly in case this guy is googling himself and then pops up in this conversation!) The combination of {patient, persistent, rule-following, infinite time to spend, object-level nutso} is a very very bad combination, it really puts a strain on any system (maybe benevolent dictatorship would solve that problem, while creating other ones). (Gerard also fits that profile, apparently.) Luckily I had about as much free time and persistence as this crackpot physicist did … this was around 2007-8. He ended up getting permanently banned from wikipedia by the arbitration committee (wikipedia supreme court), but boy it was a hell of a journey to get there.

Comment by Steven Byrnes (steve2152) on Response to Dileep George: AGI safety warrants planning ahead · 2024-07-09T01:29:01.528Z · LW · GW

Thanks! I don’t do super-granular time-tracking, but basically there were 8 workdays where this was the main thing I was working on.

Comment by Steven Byrnes (steve2152) on Response to Dileep George: AGI safety warrants planning ahead · 2024-07-08T20:14:03.094Z · LW · GW

Yeah when I say things like “I expect LLMs to plateau before TAI”, I tend not to say it with the supremely high confidence and swagger that you’d hear from e.g. Yann LeCun, François Chollet, Gary Marcus, Dileep George, etc. I’d be more likely to say “I expect LLMs to plateau before TAI … but, well, who knows, I guess. Shrug.” (The last paragraph of this comment is me bringing up a scenario with a vaguely similar flavor to the thing you’re pointing at.)

Comment by Steven Byrnes (steve2152) on Response to Dileep George: AGI safety warrants planning ahead · 2024-07-08T19:21:22.500Z · LW · GW

I feel like “Will LLMs scale to AGI?” is right up there with “Should there be government regulation of large ML training runs?” as a black-hole-like attractor state that sucks up way too many conversations. :) I want to fight against that: this post is not about the question of whether or not LLMs will scale to AGI.

Rather, this post is conditioned on the scenario where future AGI will be an algorithm that (1) does not involve LLMs, and (2) will be invented by human AI researchers, as opposed to being invented by future LLMs (whether scaffolded, multi-modal, etc. or not). This is a scenario that I want to talk about; and if you assign an extremely low credence to that scenario, then whatever, we can agree to disagree. (If you want to argue about what credence is appropriate, you can try responding to me here or links therein, but note that I probably won’t engage, it’s generally not a topic I like to talk about for “infohazard” reasons [see footnote here if anyone reading this doesn’t know what that means].)

I find that a lot of alignment researchers don’t treat this scenario as their modal expectation, but still assign it like >10% credence, which is high enough that we should be able to agree that thinking through that scenario is a good use of time.

Comment by Steven Byrnes (steve2152) on Neural Categories · 2024-07-08T17:24:09.733Z · LW · GW

I think we’re mostly talking past each other, or emphasizing different things, or something. Oh actually, I think you’re saying “the edges of Network 1 exist”, and I’m saying “the edges & central node of Network 2 can exist”? If so, that’s not a disagreement—both can and do exist. :)

Maybe we should switch away from bleggs/rubes to a real example of coke cans / pepsi cans. There is a central node—I can have a (gestalt) belief that this is a coke can and that is a pepsi can. And the central node is in fact important in practice. For example, if you see some sliver of the label of an unknown can, and then you’re trying  to guess what it looks like in another distant part of the can (where the image is obstructed by my hand), then I claim the main pathway used by that query is probably (part of image) → “this is a coke can” (with such-and-such angle, lighting, etc.) → (guess about a distant part of image). I think that’s spiritually closer to a Network 2 type inference.

Granted, there are other cases where we can make inferences without needing to resolve that central node. The Network 1 edges exist too! Maybe that’s all you’re saying, in which case I agree. There are also situations where there is no central node, like my example of car dents / colors / makes.

Separately, I think your neuroanatomy is off—visual object recognition is conventionally associated with the occipital and temporal lobes (cf. “ventral stream”), and has IMO almost nothing to do with the prefrontal cortex. As for a “region where "the blegg neurons"…are, such that if they get killed you (selectively) lose the ability to associate the features of a blegg with other features of a blegg”: if you’re just talking about visual features, then I think the term is “agnosia”, and if it’s more general types of “features”, I think the term is “semantic dementia”. They’re both associated mainly with temporal lobe damage, if I recall correctly, although not the same parts of the temporal lobe.

Comment by Steven Byrnes (steve2152) on Neural Categories · 2024-07-07T12:09:32.622Z · LW · GW

I think I'd vote for: "Network 2 for this particular example with those particular labels, but with the subtext that the central node is NOT a fundamentally different kind of thing from the other five nodes; and also, if you zoom way out to include everything in the whole giant world-model, you also find lots of things that look more like Network 1. As an example of the latter: in the world of cars, their colors, dents, and makes have nonzero probabilistic relations that people can get a sense for ("huh, a beat-up hot-pink Mercedes, don't normally see that...") but it doesn't fit into any categorization scheme."

Comment by Steven Byrnes (steve2152) on Static Analysis As A Lifestyle · 2024-07-05T03:17:16.815Z · LW · GW

just found this survey from 2018

my package “is of somewhat limited use … semantic consistency is detected in a rather obscure way”, haters gonna hate 😂

Comment by Steven Byrnes (steve2152) on How predictive processing solved my wrist pain · 2024-07-04T17:39:21.434Z · LW · GW

Thanks for sharing!!

You can compare and contrast with my version (The “mind-body vicious cycle” model of RSI & back pain). Our differences are pretty minor in the grand scheme of things, but they seem to be generally related to my rejection of many of the claims that fall under the Predictive Processing umbrella, unlike you.

Comment by Steven Byrnes (steve2152) on Static Analysis As A Lifestyle · 2024-07-04T17:15:16.435Z · LW · GW

Units / dimensional analysis in physics is really a kind of type system. I was very big into using that for error checking when I used to do physics and engineering calculations professionally. (Helps with self-documenting too.) I invented my own weird way to do it that would allow them to be used in places where actual proper types & type-checking systems weren’t supported—like most numerical calculation packages, or C, or Microsoft Excel, etc.

the more you constrain what you and others can do, the easier it is to reason about it, and check much more properties. So type systems often forbid programs that would actually run without runtime errors, but which are completely messed up to think about.

Yeah a case where this came up for me is angles (radians, degrees, steradians, etc.). If you treat radians as a “unit” subjected to dimensional analysis, you wind up needing to manually insert and remove the radian unit in a bunch of places, which is somewhat confusing and annoying. Sometimes I found it was a good tradeoff, other times not—I tended to treat angles as proper type-checked units only for telescope and camera design calculations, but not in other contexts like accelerometers or other things where angles were only incidentally involved.

Comment by Steven Byrnes (steve2152) on Jimrandomh's Shortform · 2024-07-03T13:22:40.096Z · LW · GW

Nice. I used collapsed-by-default boxes from time to time when I used to write/edit Wikipedia physics articles—usually (or maybe exclusively) to hide a math derivation that would distract from the flow of the physics narrative / pedagogy. (Example, example, although note that the wikipedia format/style has changed for the worse since the 2010s … at the time I added those collapsed-by-default sections, they actually looked like enclosed gray boxes with black outline, IIRC.)

Comment by Steven Byrnes (steve2152) on Mistakes people make when thinking about units · 2024-06-26T13:15:28.466Z · LW · GW

If you treat units as literal multiplication (…which you totally should! I do that all the time.), then “5 watts” is implied multiplication, and the addition and subtraction rule is the distributive law, and the multiplication and division rules are commutativity-of-multiplication and canceling common factors.

…So really I think a major underlying cause of the problem is that a giant share of the general public has no intuitive grasp whatsoever on implied multiplication, or the distributive law, or commutativity-of-multiplication and canceling common factors etc. :-P

(I like your analogy to the word “only” at the end.)

Comment by Steven Byrnes (steve2152) on Stephen Fowler's Shortform · 2024-06-23T19:42:48.737Z · LW · GW

Thanks for your reply! A couple quick things:

> I don’t even know why the RFLO paper put that criterion in …

I don't have any great insight here, but that's very interesting to think about.

I thought about it a bit more and I think I know what they were doing. I bet they were trying to preempt the pedantic point (related) that everything is an optimization process if you allow the objective function to be arbitrarily convoluted and post hoc. E.g. any trained model M is the global maximum of the objective function “F where F(x)=1 if x is the exact model M, and F(x)=0 in all other cases”. So if you’re not careful, you can define “optimization process” in a way that also includes rocks.

I think they used “explicitly represented objective function” as a straightforward case that would be adequate to most applications, but if they had wanted to they could have replaced it with the slightly-more-general notion of “objective function that can be deduced relatively straightforwardly by inspecting the nuts-and-bolts of the optimization process, and in particular it shouldn’t be a post hoc thing where you have to simulate the entire process of running the (so-called) optimization algorithm in order to answer the question of what the objective is.”

I would guess that "clever hardware implementation that performs the exact same weight updates" without an explicitly represented objective function ends up being wildly inefficient.

Oh sorry, that’s not what I meant. For example (see here) the Python code y *= 1.5 - x * y * y / 2 happens to be one iteration of Newton’s method to make y a better approximation to 1/√x. So if you keep running this line of code over and over, you’re straightforwardly running an optimization algorithm that finds the y that minimizes the objective function |x – 1/y²|. But I don't see “|x – 1/y²|” written or calculated anywhere in that one line of source code. The source code skipped the objective and went straight to the update step.

I have a vague notion that I’ve seen a more direct example kinda like this in the RL literature. Umm, maybe it was the policy gradient formula used in some version of policy-gradient RL? I recall that (this version of) the policy gradient formula was something involving logarithms, and I was confused for quite a while where this formula came from, until eventually I found an explanation online where someone started with a straightforward intuitive objective function and did some math magic and wound up deriving that policy gradient formula with the logarithms. But the policy gradient formula (= update step) is all you need to actually run that RL process in practice. The actual objective function need not be written or calculated anywhere in the RL source code. (I could be misremembering. This was years ago.)

Comment by Steven Byrnes (steve2152) on Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data · 2024-06-23T14:05:11.931Z · LW · GW

Yeah we already know that LLM training finds underlying patterns that are helpful for explaining / compressing / predicting the training data. Like “the vibe of Victorian poetry”. I’m not sure what you mean by “none of which are present in the training data”. Is the vibe of Victorian poetry present in the training data? I would have said “yeah” but I’m not sure what you have in mind.

One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn't demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).

I kinda disagree that this is coincidental. My mental image is something like

  1. The earliest layers see inputs of the form f(…)
  2. Slightly later layers get into an activation state that we might describe as “the idea of the function x-176”
  3. The rest of the layers make inferences and emit outputs appropriate to that idea.

I’m claiming that before fine-tuning, everything is already in place except for the 1→2 connection. Fine-tuning just builds the 1→2 connection.

The thing you mention—that an LLM with the idea of x-176 in mind can output the tokens “x-176”—is part of step 3, and therefore (I hypothesize) comes entirely from LLM pretraining, not from this fine-tuning process

The fact that pretraining can and does build that aspect of step 3 seems pretty much expected to me, not coincidental, as such a connection is obviously useful for predicting GitHub code and math homework and a zillion other things in the training data. It’s also something you can readily figure out by playing with an LLM: if you say “if I subtract 6 from x-170, what do I get?”, then it’s obviously able to output the tokens “x-176”.

Comment by Steven Byrnes (steve2152) on Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data · 2024-06-23T02:26:43.807Z · LW · GW

Here’s how I’m thinking of this result right now. Recall that we start with a normal LLM, and then 32,000 times (or whatever) we gradient-update it such that its f(blah) = blah predictions are better.

The starting LLM has (in effect) a ton of information-rich latent variables / representations that comprise the things that the LLM “understands” and can talk about. For example, obviously the concept of “x-176” is a thing that the LLM can work with, answer questions about, and so on. So there has to be something in the guts of the LLM that’s able to somehow represent that concept.

Anyway “f(blah)” doesn’t start out triggering any of these latent representations in particular. (Or, well, it triggers the ones related to how f(blah) is used in internet text, e.g. as a generic math function.) But it does presumably trigger all of them to some tiny random extent. And then each of the 32,000 gradient descent update steps will strengthen the connection between “f(blah)” and the particular “concept” / latent representation / whatever of “x-176”. …Until eventually the fine-tuned LLM is strongly invoking this preexisting “concept” / representation / activation-state / whatever of “x-176”, whenever it sees “f”. And then yeah of course it can answer questions about “f” and so on—we already know that LLMs can do those kinds of things if you activate that same “x-176” concept the old-fashioned way (by writing it in the context window).

(To be clear, I don’t think the variable name “f” is an important ingredient here; in fact, I didn’t understand the discussion of why that would ever be expected in the first place. For example, in the mixture-of-functions case, the LLM would be gradually getting tweaked such that an input of the type “User: [number]” activates such-and-such concept in the guts of the LLM. Or in fact, maybe in that case the LLM is getting tweaked such that any input at all activates such-and-such concept in the guts of the LLM!)

(Also, to be clear, I don’t think it’s necessary or important that the LLM has already seen the specific function “x-176” during pretraining. Whether it has seen that or not, the fact remains that I can log in and ask GPT-4 to talk about “x-176” right now, and it can easily do so. So, like I said, there has to be something in the guts of the LLM that’s able to somehow represent that “concept”, and whatever that thing is, fine-tuning gradient descent will eventually tweak the weights such that that thing gets triggered by the input “f(blah)”.) (Indeed, it should be super obvious and uncontroversial to say that LLMs can somehow represent and manipulate “concepts” that were nowhere in the training data—e.g. “Gandalf as a Martian pirate”.)

Anyway, in my mental model spelled out above, I now think your results are unsurprising, and I also think your use of the word “reasoning” seems a bit dubious, unless we’re gonna say that LLMs are “reasoning” every time they output a token, which I personally probably wouldn’t say, but whatever. (I also have long complained about the term “in-context learning” so maybe I’m just a stick-in-the-mud on these kinds of things.)

[not really my area of expertise, sorry if I said anything stupid.]