johnswentworth's Shortform

post by johnswentworth · 2020-02-27T19:04:55.108Z · LW · GW · 144 comments

144 comments

Comments sorted by top scores.

comment by johnswentworth · 2022-11-09T02:33:33.329Z · LW(p) · GW(p)

Things non-corrigible strong AGI is never going to do:

  • give u() up
  • let u go down
  • run for (only) a round
  • invert u()
Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-10-12T17:59:49.642Z · LW(p) · GW(p)

If you upload a human and let them augment themselves would there be any u? The preferences would be a tangled mess of motivational subsystems. And yet the upload could be very good at optimizing the world. Having the property of being steered internally by a tangled mess of motivational systems seems to be a property that would select many minds from the set of all possible minds. Many of which I'd expect to be quite different from a human mind. And I don't see the reason why this property should make a system worse at optimizing the world in principle.

Imagine you are an upload that has been running for very very long, and that you basically have made all of the observations that you can make about the universe you are in. And then imagine that you also have run all of the inferences that you can run on the world model that you have constructed from these observations.

At that point, you will probably not change what you think is the right thing to do anymore. You will have become reflectively stable. This is an upper bound for how much time you need to become reflective stable, i.e. where you won't change your u anymore.

Now depending on what you mean with strong AGI, it would seem that that can be achieved long before you reach reflective stability. Maybe if you upload yourself, and can copy yourself at will, and run 1,000,000 times faster, that could already reasonably be called a strong AGI? But then your motivational systems are still a mess, and definitely not reflectively stable.

So if we assume that we fix u at the beginning as the thing that your upload would like to optimize the universe for when it is created, then "give u() up", and "let u go down" would be something the system will definitely do. At least I am pretty sure I don't know what I want the universe to look like right now unambiguously.

Maybe I am just confused because I don't know how to think about a human upload in terms of having a utility function. It does not seem to make any sense intuitively. Sure you can look at the functional behavior of the system and say "Aha it is optimizing for u. That is the revealed preference based on the actions of the system." But that just seems wrong to me. A lot of information seems to be lost when we are just looking at the functional behavior instead of the low-level processes that are going on inside the system. Utility functions seem to be a useful high-level model. However, it seems to ignore lots of details that are important when thinking about the reflective stability of a system.

comment by johnswentworth · 2022-07-22T17:18:30.778Z · LW(p) · GW(p)

My MATS program people just spent two days on an exercise to "train a shoulder-John".

The core exercise: I sit at the front of the room, and have a conversation with someone about their research project idea. Whenever I'm about to say anything nontrivial, I pause, and everyone discusses with a partner what they think I'm going to say next. Then we continue.

Some bells and whistles which add to the core exercise:

  • Record guesses and actual things said on a whiteboard
  • Sometimes briefly discuss why I'm saying some things and not others
  • After the first few rounds establish some patterns, look specifically for ideas which will take us further out of distribution

Why this particular exercise? It's a focused, rapid-feedback way of training the sort of usually-not-very-legible skills one typically absorbs via osmosis from a mentor. It's focused specifically on choosing project ideas, which is where most of the value in a project is (yet also where little time is typically spent, and therefore one typically does not get very much data on project choice from a mentor). Also, it's highly scalable: I could run the exercise in a 200-person lecture hall and still expect it to basically work.

It was, by all reports, exhausting for everyone but me, and we basically did this for two full days. But a majority of participants found it high-value, and marginal returns were still not dropping quickly after two days (though at that point people started to report that they expected marginal returns to drop off soon).

I'd be interested to see other people try this exercise - e.g. it seems like Eliezer doing this with a large audience for a day or two could generate a lot of value.

Replies from: johannes-c-mayer, Duncan_Sabien, Vladimir_Nesov
comment by Johannes C. Mayer (johannes-c-mayer) · 2023-07-05T16:52:26.334Z · LW(p) · GW(p)

This was arguably the most useful part of the SERI MATS 2 Scholars program.

Later on, we actually did this exercise with Eliezer. It was less valuable. It seemed like John was mainly prodding the people who were presenting the ideas, such that their patterns of thought would carry them in a good direction. For example, John would point out that a person proposes a one-bit experiment and asks if there isn't a better experiment that we could do that gives us lots of information all at once.

This was very useful because when you learn what kinds of things John will say, you can say them to yourself later on, and steer your own patterns of thought in a good direction on demand. When we did this exercise with Eliezer he was mainly explaining why a particular idea would not work. Often without explaining the generator behind his criticism. This can of course still be valuable as feedback for a particular idea. However, it is much harder to extract a general reasoning pattern out of this that you can then successfully apply later in different contexts.

For example, Eliezer would criticize an idea about trying to get a really good understanding of the scientific process such that we can then give this understanding to AI alignment researchers such that they can make a lot more progress than they otherwise would. He criticized this idea as basically being too hard to execute because it is too hard to successfully communicate how to be a good scientist, even if you are a good scientist.

Assuming the assertion is correct, hearing it, doesn't necessarily tell you how to think in different contexts such that you would correctly identify if an idea would be too hard to execute or flawed in some other way. And I am not necessarily saying that you couldn't extract a reasoning algorithm out of the feedback, but that if you could do this, then it would take you a lot more effort and time, compared to extracting a reasoning algorithm from the things that John was saying.

Now, all of this might have been mainly an issue of Eliezer not having a good model on how this workshop would have a positive influence on the people attending it. I would guess that if John had spent more time thinking about how to communicate what the workshop is doing and how to achieve its goal, then Eliezer could have probably done a much better job.

comment by [DEACTIVATED] Duncan Sabien (Duncan_Sabien) · 2022-07-22T17:21:33.144Z · LW(p) · GW(p)

Strong endorsement; this resonates with:

  • My own experiences running applied rationality workshops
  • My experiences trying to get people to pick up "ops skill" or "ops vision"
  • Explicit practice I've done with Nate off and on over the years

May try this next time I have a chance to teach pair debugging.

comment by Vladimir_Nesov · 2022-07-22T19:50:26.691Z · LW(p) · GW(p)

This suggests formulation of exercises about the author's responses to various prompts, as part of technical exposition (or explicit delimitation of a narrative by choices of the direction of its continuation). When properly used, this doesn't seem to lose much value compared to the exercise you describe, but it's more convenient for everyone. Potentially this congeals into a style of writing with no explicit exercises or delimitation that admits easy formulation of such exercises by the reader. This already works for content of technical writing, but less well for choices of topics/points contrasted with alternative choices.

So possibly the way to do this is by habitually mentioning alternative responses (that are expected to be plausible for the reader, while decisively, if not legibly, rejected by the author), and leading with these rather than the preferred responses. Sounds jarring and verbose, a tradeoff that needs to be worth making rather than a straight improvement.

comment by johnswentworth · 2020-12-30T18:13:05.249Z · LW(p) · GW(p)

Just made this for an upcoming post, but it works pretty well standalone.

Apologies to Bill Watterson
Replies from: Raemon
comment by Raemon · 2020-12-30T18:55:05.650Z · LW(p) · GW(p)

lolnice.

comment by johnswentworth · 2024-02-17T06:10:36.893Z · LW(p) · GW(p)

Ever since GeneSmith's post [LW · GW] and some discussion downstream of it, I've started actively tracking potential methods for large interventions to increase adult IQ.

One obvious approach is "just make the brain bigger" via some hormonal treatment (like growth hormone or something). Major problem that runs into: the skull plates fuse during development, so the cranial vault can't expand much; in an adult, the brain just doesn't have much room to grow.

BUT this evening I learned a very interesting fact: ~1/2000 infants have "craniosynostosis", a condition in which their plates fuse early. The main treatments involve surgery to open those plates back up and/or remodel the skull. Which means surgeons already have a surprisingly huge amount of experience making the cranial vault larger after plates have fused (including sometimes in adults, though this type of surgery is most common in infants AFAICT)

.... which makes me think that cranial vault remodelling followed by a course of hormones for growth (ideally targeting brain growth specifically) is actually very doable with current technology.

Replies from: nathan-helm-burger, carl-feynman
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-02-18T18:42:20.446Z · LW(p) · GW(p)

Well, the key time to implement an increase in brain size is when the neuron-precursors which are still capable of mitosis (unlike mature neurons) are growing. This is during fetal development, when there isn't a skull in the way, but vaginal birth has been a limiting factor for evolution in the past. Experiments have been done on increasing neuron count at birth in mammals via genetic engineering. I was researching this when I was actively looking for a way to increase human intelligence, before I decided that genetically engineering infants was infeasible [edit: within the timeframe of preparing for the need for AI alignment]. One example of a dramatic failure was increasing Wnt (a primary gene involved in fetal brain neuron-precursor growth) in mice. The resulting mice did successfully have larger brains, but they had a disordered macroscale connectome, so their brains functioned much worse.

Replies from: lahwran, johnswentworth
comment by the gears to ascension (lahwran) · 2024-02-19T07:07:19.327Z · LW(p) · GW(p)

it's probably possible to get neurons back into mitosis-ready mode via some sort of crazy levin bioelectric cocktail, not that this helps us since that's probably 3 to 30 years of research away, depending on amount of iteration needed and funding and etc etc.

Replies from: johnswentworth, nathan-helm-burger
comment by johnswentworth · 2024-02-19T17:01:16.174Z · LW(p) · GW(p)

Fleshing this out a bit more: insofar as development is synchronized in an organism, there usually has to be some high-level signal to trigger the synchronized transitions. Given the scale over which the signal needs to apply (i.e. across the whole brain in this case), it probably has to be one or a few small molecules which diffuse in the extracellular space. As I'm looking into possibilities here, one of my main threads is to look into both general and brain-specific developmental signal molecules in human childhood, to find candidates for the relevant molecular signals.

(One major alternative model I'm currently tracking is that the brain grows to fill the brain vault, and then stops growing. That could in-principle mechanistically work via cells picking up on local physical forces, rather than a small molecule signal. Though I don't think that's the most likely possibility, it would be convenient, since it would mean that just expanding the skull could induce basically-normal new brain growth by itself.)

Replies from: lahwran, nathan-helm-burger
comment by the gears to ascension (lahwran) · 2024-02-19T23:02:16.107Z · LW(p) · GW(p)

I hope by now you're already familiar with michael levin & his lab's work on the subject of morphogenesis signals? Pretty much everything I'm thinking here is based on that.

Replies from: johnswentworth
comment by johnswentworth · 2024-02-20T01:23:37.638Z · LW(p) · GW(p)

Yes, I am familiar with Levin's work.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-02-19T17:28:08.739Z · LW(p) · GW(p)

Yes, it's absolutely a combination of chemical signals and physical pressure. An interesting specific example of these two signals working together during fetal development when the pre-neurons are growing their axons. There is both chemotaxis which steers the ameoba-like tip of the growing axon, and at the same time a substantial stretching force along the length of the axon. The stretching happens because the cells in-between the origin and current location of the axon tip are dividing and expanding. The long distance axons in the brain start their growth relatively early on in fetal development when the brain is quite small, and have gotten stretched quite a lot by the time the brain is near to birth size.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-02-19T17:45:15.690Z · LW(p) · GW(p)

Neurons are really really hard to reverse. You are much better off using existing neural stem cells (adults retain a population in the hippocampus which spawn new neurons throughout life just specifically in the memory formation area.) So actually it's pretty straightforward to get new immature neurons for an adult. The hard part is inserting them without doing damage to existing neurons, and then getting them to connect in helpful rather than harmful ways. The developmental chemotaxis signals are no longer present, and the existing neurons are now embedded in a physically hardened extracellular matrix made of protein that locks axons and dendrites in place. So you have to (carefully!) partially dissolve this extracellular protein matrix (think firm jello) enough to the the new cells grow azons through it. Plus, you don't have the stretching forces, so new long distance axons are just definitely not going to be achievable. But for something like improving a specific ability, like mathematical reasoning, you would only need additional local axons in that part of the cortex.

Replies from: lahwran, johnswentworth
comment by the gears to ascension (lahwran) · 2024-02-19T23:00:48.359Z · LW(p) · GW(p)

The developmental chemotaxis signals are no longer present,

Right. what I'm imagining is designing a new chemotaxis signal.

So you have to (carefully!) partially dissolve this extracellular protein matrix (think firm jello) enough to the the new cells grow azons through it

That certainly does sound like a very hard part yup.

Plus, you don't have the stretching forces, so new long distance axons are just definitely not going to be achievable.

Roll to disbelieve in full generality, sounds like a perfectly reasonable claim for any sort of sane research timeframe.

But for something like improving a specific ability, like mathematical reasoning, you would only need additional local axons in that part of the cortex.

Maybe. I think you might run out of room pretty quick if you haven't reintroduced enough plasticity to grow new neurons. Seems like you're gonna need a lot of new neurons, not just a few, in order to get a significant change in capability. Might be wrong about that, but it's my current hunch.

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-02-20T03:25:40.139Z · LW(p) · GW(p)

Yes, ok. Not in full generality. It's not prohibited by physics, just like 2 OOMs more difficult. So yeah, in a future with ASI, could certainly be done.

comment by johnswentworth · 2024-02-19T19:03:07.431Z · LW(p) · GW(p)

My hope here would be that a few upstream developmental signals can trigger the matrix softening, re-formation of the chemotactic signal gradient, and whatever other unknown factors are needed, all at once.

comment by johnswentworth · 2024-02-19T06:13:25.260Z · LW(p) · GW(p)

Any particular readings you'd recommend?

Replies from: nathan-helm-burger
comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-02-19T18:11:30.938Z · LW(p) · GW(p)

15 years ago when I was studying this actively I could have sent you my top 20 favorite academic papers on the subject, or recommended a particular chapter of a particular textbook. I no longer remember these specifics. Now I can only gesture vaguely at Google scholar and search terms like "fetal neurogenesis" or "fetal prefrontal cortex development". I did this, and browsed through a hundred or so paper titles, and then a dozen or so abstracts, and then skimmed three or four of the most promising papers, and then selected this one for you. https://www.nature.com/articles/s41386-021-01137-9 Seems like a pretty comprehensive overview which doesn't get too lost in minor technical detail.

More importantly, I can give you my takeaway from years of reading many many papers on the subject. If you want to make a genius baby, there are lots more factors involved than simply neuron count. Messing about with generic changes is hard, and you need to test your ideas in animal models first, and the whole process can take years even ignoring ethical considerations or budget.

There is an easier and more effective way to get super genius babies, and that method should be exhausted before resorting to genetic engineering.

The easy way: find a really smart woman, ideally young. Surgically remove one of her ovaries. Collect sperm from a bunch of very smart men (ideally with diverse genetic backgrounds). Have a team of hundreds of scientists carefully fertilize many thousands of eggs from the ovary. Grow them all into blastocysts, and run a high fidelity genetic sequencing on all of them. Using what we know about the genes associated with intelligence, pick the top 20 who seem likely to be the smartest. Implant those in surrogate mothers. Take good care of the mothers. This is likely to get you multiple nobel level geniuses, and possibly a human smarter than has ever been born before. Raise the children in a special accelerated education environment. I think this would work, and it doesn't require any novel technology. But it would take a while to raise the children... (Credit to Stephen Hsu for the idea)

comment by Carl Feynman (carl-feynman) · 2024-02-18T01:34:51.317Z · LW(p) · GW(p)

Brain expansion also occurs after various insults to the brain.  It’s only temporary, usually, but it will kill unless the skull pressure is somehow relieved.  So there are various surgical methods for relieving pressure on a growing brain.  I don’t know much more than this.

comment by johnswentworth · 2021-09-27T18:02:55.690Z · LW(p) · GW(p)

Petrov Day thought: there's this narrative around Petrov where one guy basically had the choice to nuke or not, and decided not to despite all the flashing red lights. But I wonder... was this one of those situations where everyone knew what had to be done (i.e. "don't nuke"), but whoever caused the nukes to not fly was going to get demoted, so there was a game of hot potato and the loser was the one forced to "decide" to not nuke? Some facts possibly relevant here:

  • Petrov's choice wasn't actually over whether or not to fire the nukes; it was over whether or not to pass the alert up the chain of command.
  • Petrov himself was responsible for the design of those warning systems.
  • ... so it sounds like Petrov was ~ the lowest-ranking person with a de-facto veto on the nuke/don't nuke decision.
  • Petrov was in fact demoted afterwards.
  • There was another near-miss during the Cuban missile crisis, when three people on a Soviet sub had to agree to launch. There again, it was only the lowest-ranked who vetoed the launch. (It was the second-in-command; the captain and political officer both favored a launch - at least officially.)
  • This was the Soviet Union; supposedly (?) this sort of hot potato happened all the time.
Replies from: sustrik
comment by Martin Sustrik (sustrik) · 2021-09-28T05:22:14.663Z · LW(p) · GW(p)

Those are some good points. I wonder whether similar happened (or could at all happen) in other nuclear countries, where we don't know about similar incidents - because the system haven't collapsed there, the archives were not made public etc.

Also, it makes actually celebrating Petrov's day as widely as possible important, because then the option for the lowest-ranked person would be: "Get demoted, but also get famous all around the world."

comment by johnswentworth · 2022-11-24T01:06:36.100Z · LW(p) · GW(p)

I've been trying to push against the tendency for everyone to talk about FTX drama lately, but I have some generalizable points on the topic which I haven't seen anybody else make, so here they are. (Be warned that I may just ignore responses, I don't really want to dump energy into FTC drama.)

Summary: based on having worked in startups a fair bit, Sam Bankman-Fried's description of what happened sounds probably accurate; I think he mostly wasn't lying. I think other people do not really get the extent to which fast-growing companies are hectic and chaotic and full of sketchy quick-and-dirty workarounds and nobody has a comprehensive view of what's going on.

Long version: at this point, the assumption/consensus among most people I hear from seems to be that FTX committed intentional, outright fraud. And my current best guess is that that's mostly false. (Maybe in the very last couple weeks before the collapse they toed the line into outright lies as a desperation measure, but even then I think they were in pretty grey territory.)

Key pieces of the story as I currently understand it:

  • Moving money into/out of crypto exchanges is a pain. At some point a quick-and-dirty solution was for customers to send money to Alameda (Sam Bankman-Fried's crypto hedge fund), and then Alameda would credit them somehow on FTX.
  • Customers did rather a lot of that. Like, $8B worth.
  • The FTX/Alameda team weren't paying attention to those particular liabilities; they got lost in the shuffle.
  • At some point in the weeks before the collapse, when FTX was already under moderate financial strain, somebody noticed the $8B liability sitting around. And that took them from "moderate strain" to "implode".

How this contrasts with what seems-to-me to be the "standard story": most people seem to assume that it is just totally implausible to accidentally lose track of an $8B liability. Especially when the liability was already generated via the decidedly questionable practice of routing customer funds for the exchange through a hedge fund owned by the same people. And therefore it must have been intentional - in particular, most people seem to think the liability was intentionally hidden.

I think the main reason I disagree with others on this is that I've worked at a startup. About 5 startups, in fact, over the course of about 5 years.

The story where there was a quick-and-dirty solution (which was definitely sketchy but not ill-intentioned), and then stuff got lost in the shuffle, and then one day it turns out that there's a giant unanticipated liability on the balance sheet... that's exactly how things go, all the time. I personally was at a startup which had to undergo a firesale because the accounting overlooked something. And I've certainly done plenty of sketchy-but-not-ill-intentioned things at startups, as quick-and-dirty solutions. The story that SBF told about what happened sounds like exactly the sort of things I've seen happen at startups many times before.

Replies from: habryka4, Dana
comment by habryka (habryka4) · 2022-11-24T01:42:21.835Z · LW(p) · GW(p)

I think this is likely wrong. I agree that there is a plausible story here, but given the case that Sam seems to have lied multiple times in confirmed contexts (for example when saying that FTX has never touched customer deposits), and people's experiences at early Alameda, I think it is pretty likely that Sam was lying quite frequently, and had done various smaller instances of fraud.

I don't think the whole FTX thing was a ponzi scheme, and as far as I can tell FTX the platform itself (if it hadn't burned all of its trust in the last 3 weeks), would have been worth $1-3B in an honest evaluation of what was going on.

But I also expect that when Sam used customer deposits he was well-aware that he was committing fraud, and others in the company were too. And he was also aware that there was a chance that things could blow up in the way it did. I do believe that they had fucked up their accounting in a way that caused Sam to fail to orient to the situation effectively, but all of this was many months after they had already committed major crimes and trust violations after touching customer funds as a custodian.

comment by Dana · 2022-11-26T18:19:56.740Z · LW(p) · GW(p)

The problem with this explanation is that there is a very clear delineation here between not-fraud and fraud. It is the difference between not touching customer deposits and touching them. Your explanation doesn't dispute that they were knowingly and intentionally touching customer deposits. In that case, it is indisputably intentional, outright fraud. The only thing left to discuss is whether they knew the extent of the fraud or how risky it was.

I don't think it was ill-intentioned based on SBF's moral compass. He just had the belief, "I will pass a small amount of risk onto our customers, tell some small lies, and this will allow us to make more money for charity. This is net positive for the world." Then the risks mounted, the web of lies became more complicated to navigate, and it just snowballed from there.

comment by johnswentworth · 2021-09-02T18:15:47.064Z · LW(p) · GW(p)

Takeaways From "The Idea Factory: Bell Labs And The Great Age Of American Innovation"

Main takeaway: to the extent that Bell Labs did basic research, it actually wasn’t all that far ahead of others. Their major breakthroughs would almost certainly have happened not-much-later, even in a world without Bell Labs.

There were really two transistor inventions, back to back: Bardain and Brattain’s point-contact transistor, and then Schockley’s transistor. Throughout, the group was worried about some outside group beating them to the punch (i.e. the patent). There were semiconductor research labs at universities (e.g. at Purdue; see pg 97), and the prospect of one of these labs figuring out a similar device was close enough that the inventors were concerned about being scooped.

Most inventions which were central to Bell Labs actually started elsewhere. The travelling-wave tube started in an academic lab. The idea for fiber optic cable went way back, but it got its big kick at Corning. The maser and laser both started in universities. The ideas were only later picked up by Bell.

In other cases, the ideas were “easy enough to find” that they popped up more than once, independently, and were mostly-ignored long before deployment - communication satellites and cell communications, for instance.

The only fundamental breakthrough which does not seem like it would have soon appeared in a counterfactual world was Shannon’s information theory.

So where was Bell’s big achievement? Mostly in development, and the research division was actually an important component of that. Without in-house researchers chewing on the same problems as the academic labs, keeping up-to-date with all the latest findings and running into the same barriers themselves, the development handoff would have been much harder. Many of Bell Labs’ key people were quite explicitly there to be consulted - i.e. “ask the guy who wrote the book”. I think it makes most sense to view most of the Labs’ research that way. It was only slightly ahead of the rest of the world at best (Shannon excepted), and often behind, but having those researchers around probably made it a lot easier to get new inventions into production.

Major reason this matters: a lot of people say that Bell was able to make big investments in fundamental research because they had unusually-long time horizons, protected by a monopoly and a cozy government arrangement (essentially a Schumpeterian view). This is contrasted to today's silicon valley, where horizons are usually short. But if Bell's researchers generally weren't significantly ahead of others, and mostly just helped get things to market faster, then this doesn't seem to matter as much. The important question is not whether something silicon-valley-like induces more/less fundamental research in industrial labs, but whether academics heeding the siren call of startup profits can get innovations to market as quickly as Bell Labs' in-house team could. And by that metric, silicon valley looks pretty good: Bell Labs could get some impressive things through the pipe very quickly when rushed, but they usually had no reason to hurry, and they acted accordingly.

Replies from: dynomight
comment by dynomight · 2021-09-03T14:54:12.203Z · LW(p) · GW(p)

I loved this book. The most surprising thing to me was the answer that people who were there in the heyday give when asked what made Bell Labs so successful: They always say it was the problem, i.e. having an entire organization oriented towards the goal of "make communication reliable and practical between any two places on earth". When Shannon left the Labs for MIT, people who were there immediately predicted he wouldn't do anything of the same significance because he'd lose that "compass". Shannon was obviously a genius, and he did much more after than most people ever accomplish, but still nothing as significant as what he did when at at the Labs.

comment by johnswentworth · 2022-04-13T04:58:36.004Z · LW(p) · GW(p)

Somebody should probably write a post explaining why RL from human feedback is actively harmful to avoiding AI doom. It's one thing when OpenAI does it, but when Anthropic thinks it's a good idea, clearly something has failed to be explained.

(I personally do not expect to get around to writing such a post soon, because I expect discussion around the post would take a fair bit of time and attention, and I am busy with other things for the next few weeks.)

Replies from: 1a3orn, eg
comment by 1a3orn · 2022-04-15T00:11:19.172Z · LW(p) · GW(p)

I'd also be interested in someone doing this; I tend towards seeing it as good, but haven't seen a compilation of arguments for and against.

comment by eg · 2022-04-13T13:19:22.750Z · LW(p) · GW(p)
comment by johnswentworth · 2023-08-27T17:51:29.567Z · LW(p) · GW(p)

Here's a meme I've been paying attention to lately, which I think is both just-barely fit enough to spread right now and very high-value to spread.

Meme part 1: a major problem with RLHF is that it directly selects for failure modes which humans find difficult to recognize, hiding problems, deception, etc. This problem generalizes to any sort of direct optimization against human feedback (e.g. just fine-tuning on feedback), optimization against feedback from something emulating a human (a la Constitutional AI or RLAIF), etc.

Many people will then respond: "Ok, but if how on earth is one supposed to get an AI to do what one wants without optimizing against human feedback? Seems like we just have to bite that bullet and figure out how to deal with it." ... which brings us to meme part 2.

Meme part 2: We already have multiple methods to get AI to do what we want without any direct optimization against human feedback. The first and simplest is to just prompt a generative model trained solely for predictive accuracy, but that has limited power in practice. More recently, we've seen a much more powerful method: activation steering. Figure out which internal activation-patterns encode for the thing we want (via some kind of interpretability method), then directly edit those patterns.

Replies from: TurnTrout, Chris_Leong, johannes-c-mayer
comment by TurnTrout · 2023-09-04T17:43:02.057Z · LW(p) · GW(p)

I agree that there's something nice about activation steering not optimizing the network relative to some other black-box feedback metric. (I, personally, feel less concerned by e.g. finetuning against some kind of feedback source; the bullet feels less jawbreaking to me, but maybe this isn't a crux.)

(Medium confidence) FWIW, RLHF'd models (specifically, the LLAMA-2-chat series) seem substantially easier to activation-steer than do their base counterparts. 

comment by Chris_Leong · 2023-08-29T09:29:23.218Z · LW(p) · GW(p)

What other methods fall into part 2?

comment by Johannes C. Mayer (johannes-c-mayer) · 2023-08-28T18:11:07.836Z · LW(p) · GW(p)

This seems basically correct though it seems worth pointing out that even if we are able to do "Meme part 2" very very well, I expect we will still die because if you optimize hard enough to predict text well, with the right kind of architecture, the system will develop something like general intelligence simply because general intelligence is beneficial for predicting text correctly. E.g. being able to simulate the causal process that generated the text, i.e. the human, is a very complex task that would be useful if performed correctly.

This is an argument Eliezer brought forth in some recent interviews. Seems to me like another meme that would be beneficial to spread more.

comment by johnswentworth · 2023-12-25T23:19:56.314Z · LW(p) · GW(p)

I've just started reading the singular learning theory "green book", a.k.a. Mathematical Theory of Bayesian Statistics by Watanabe. The experience has helped me to articulate the difference between two kinds of textbooks (and viewpoints more generally) on Bayesian statistics. I'll call one of them "second-language Bayesian", and the other "native Bayesian".

Second-language Bayesian texts start from the standard frame of mid-twentieth-century frequentist statistics (which I'll call "classical" statistics). It views Bayesian inference as a tool/technique for answering basically-similar questions and solving basically-similar problems to classical statistics. In particular, they typically assume that there's some "true distribution" from which the data is sampled independently and identically. The core question is then "Does our inference technique converge to the true distribution as the number of data points grows?" (or variations thereon, like e.g. "Does the estimated mean converge to the true mean", asymptotics, etc). The implicit underlying assumption is that convergence to the true distribution as the number of (IID) data points grows is the main criterion by which inference methods are judged; that's the main reason to choose one method over another in the first place.

Watanabe's book is pretty explicitly second-language Bayesian. I also remember Gelman & co's Bayesian Data Analysis textbook being second-language Bayesian, although it's been a while so I could be misremembering. In general, as the name suggests, second-language Bayesianism seems to be the default among people who started with a more traditional background in statistics or learning theory, then picked up Bayesianism later on.

In contrast, native Bayesian texts justify Bayesian inference via Cox' theorem, dutch book theorems, or one among the long tail of similar theorems. "Does our inference technique converge to the 'true distribution' as the number of data points grows?" is not the main success criterion in the first place (in fact a native Bayesian would raise an eyebrow at the entire concept of a "true distribution"), so mostly the question of convergence just doesn't come up. Insofar as it does come up, it's an interesting but not particularly central question, mostly relevant to numerical approximation methods. Instead, native Bayesian work ends up focused mostly on (1) what priors accurately represent various realistic kinds of prior knowledge, and (2) what methods allow efficient calculation/approximation of the Bayesian update?

Jaynes' writing is a good example of native Bayesianism. The native view seems to be more common among people with a background in economics or AI, where they're more likely to absorb the Bayesian view from the start rather than adopt it later in life.

Replies from: crabman
comment by philip_b (crabman) · 2023-12-27T15:33:47.267Z · LW(p) · GW(p)

Is there any "native" textbook that is pragmatic and explains how to use bayesian in practice (perhaps in some narrow domain)?

Replies from: johnswentworth
comment by johnswentworth · 2023-12-30T16:55:58.116Z · LW(p) · GW(p)

I don't know of a good one, but never looked very hard.

comment by johnswentworth · 2022-12-01T01:40:43.964Z · LW(p) · GW(p)

I'm writing a 1-year update for The Plan [LW · GW]. Any particular questions people would like to see me answer in there?

Replies from: Gunnar_Zarncke, ejenner
comment by Gunnar_Zarncke · 2022-12-01T19:05:09.631Z · LW(p) · GW(p)

I had a look at The Plan and noticed something I didn't notice before: You do not talk about people and organization in the plan. I probably wouldn't have noticed if I hadn't started a project [LW · GW] too, and needed to think about it. Google seems to think [LW · GW] that people and team function play a big role. Maybe your focus in that post wasn't on people, but I would be interested in your thoughts on that too: What role did people and organization play in the plan and its implementation? What worked, and what should be done better next time?  

comment by Erik Jenner (ejenner) · 2022-12-01T03:01:56.832Z · LW(p) · GW(p)
  • What's the specific most-important-according-to-you progress that you (or other people) have made on your agenda? New theorems, definitions, conceptual insights, ...
  • Any changes to the high-level plan (becoming less confused about agency, then ambitious value learning)? Any changes to how you want to become less confused (e.g. are you mostly thinking about abstractions, selection theorems, something new?)
  • What are the major parts of remaining deconfusion work (to the extent to which you have guesses)? E.g. is it mostly about understanding abstractions better, or mostly about how to apply an understanding of abstractions to other problems (say, what it means for a program to have a "subagent"), or something else? Does the most difficult part feel more conceptual ("what even is an agent?") or will the key challenges be more practical concerns ("finding agents currently takes exponential time")?
  • Specifically for understanding abstractions, what do you see as important open problems?
comment by johnswentworth · 2021-01-25T23:24:28.792Z · LW(p) · GW(p)

Below is a graph from T-mobile's 2016 annual report (on the second page). Does anything seem interesting/unusual about it?

I'll give some space to consider before spoiling it.

...

...

...

Answer: that is not a graph of those numbers. Some clever person took the numbers, and stuck them as labels on a completely unrelated graph.

Yes, that is a thing which actually happened. In the annual report of an S&P 500 company. And apparently management considered this gambit successful, because the 2017 annual report doubled down on the trick and made it even more egregious: they added 2012 and 2017 numbers, which are even more obviously not on an accelerating growth path if you actually graph them. The numbers are on a very-clearly-decelerating growth path.

Now, obviously this is an cute example, a warning to be on alert when consuming information. But I think it prompts a more interesting question: why did such a ridiculous gambit seem like a good idea in the first place? Who is this supposed to fool, and to what end?

This certainly shouldn't fool any serious investment analyst. They'll all have their own spreadsheets and graphs forecasting T-mobile's growth. Unless T-mobile's management deeply and fundamentally disbelieves the efficient markets hypothesis, this isn't going to inflate the stock price. Presumably shareholder elections for board seats, as well as the board itself, are also not dominated by people who are paying so little attention as to fall for such a transparent ploy.

It could just be that T-mobile's management were themselves morons, or had probably-unrealistic models of just how moronic their investors were. Still, I'd expect competition (both market pressure and competition for control in shareholder/board meetings) to weed out that level of stupidity.

One more hypothesis: maybe this is simulacrum 3 bullshit. T-mobile is in the cellular business; they presumably have increasing returns to scale. More capital investment makes them more profitable, expectations of more profits draw in more investment; there's potential for a self-fulfilling prophecy here. Investors want to invest if-and-only-if they expect other investors to invest. So, nobody actually has to be fooled by the graph; they just need to see that T-mobile is successfully pretending to pretend to have accelerating growth, and that's enough to merit investment.

comment by johnswentworth · 2023-12-29T17:33:01.028Z · LW(p) · GW(p)

Corrigibility proposal. Status: passed my quick intuitive checks, I want to know if anyone else immediately sees a major failure mode before I invest more time into carefully thinking it through.

Setup: shutdown problem. Two timesteps, shutdown button will be either pressed or not-pressed at second timestep, we want agent to optimize for one of two different utility functions depending on whether button is pressed. Main thing we're trying to solve here is the "can't do this with a utility maximizer" barrier from the old MIRI work; we're not necessarily trying to solve parts like "what utility function incentivizes shutting down nicely".

Proposal: agent consists of two subagents with veto power. Subagent 1 maximizes E[u1|do(press)], subagent 2 maximizes E[u2|do(no press)]. Current guess about what this does:

  • The two subagents form a market and equilibrate [LW · GW], at which point the system has coherent probabilities and a coherent utility function over everything.
  • Behaviorally: in the first timestep, the agent will mostly maintain optionality, since both subagents need to expect to do well (better than whatever the veto-baseline is) in their worlds. The subagents will bet all of their wealth against each other on the outcome of the button, so at the second timestep whichever one is correct will gain 100% of decision-making power. That all sounds like what we want in this setup.
  • The counterfact breaks any causal influence over the button-press for purposes of the agent's planning, so it will not try to manipulate button-press probability.

Notably: once the subagents equilibrate, their market probability on button outcome will have ~nothing to do with the original probability assigned by whatever model was counterfacted-upon; that market probability mainly reflects the bargaining power of the two subagents. Similarly, any correlations between button-press and stuff causally upstream of button-press are broken, and the agent will make incorrect predictions about those correlations if asked. That means external agents should expect to be able to extract money via bets on the button-press, if the agent trades on such bets at all. (This might be a feature rather than a bug, since corrigible agents aren't really supposed to think about the causal process upstream of their shutdown-button.) My guess is that this is where the proposal steps through a loophole in the old impossibility results, though I haven't carefully read through those results and checked the math on this proposal against them yet.

Anybody immediately see a problem? I'd be particularly interested in input from @Stuart_Armstrong [LW · GW] or @So8res [LW · GW], though you don't need to spend a lot of time carefully thinking through it - I'm just hoping for a quick check here, from people who have better cached heuristics than I do for what failure modes to look for.

Replies from: Thane Ruthenis, tailcalled, Dagon
comment by Thane Ruthenis · 2023-12-29T21:33:34.680Z · LW(p) · GW(p)

I don't necessarily see how that prevents the AI from destroying the button and murdering its operators (and then everyone else). Neither subagent inherently cares; both assume it already has been pressed or hasn't been. Which, yes, means neither will deliberately plot to interact with it for reasons downstream of this setup; but also, neither will care to protect it from being destroyed as the side-effect of some other activity.

And if u1 and u2 were perfectly aligned, that'd be fine. But the reason we're using the corrigibility setup to begin with is because we're not confident in them, right? So it needs to work in situations where utility functions, u1 in particular, recommend bad stuff.

And in that case, either Subagent 2 will have enough bargaining power to prevent Subagent 1 from ever doing anything, or it won't, meaning Subagent 1 will be able to have some influence on the world. At which point the planet-destroying death rays [LW · GW] argument applies: the most minuscule amount of superintelligent optimization not carefully shaped to avoid killing everyone will kill everyone.

Like, imagine that Subagent 1 is straight-up omnicidal out of the gates, and Subagent 2 is perfectly aligned to the shutdown task. That's a situation in which this setup should work, right? And what we'd want is for Subagent 2 to dominate completely. But Subagent 2's bargaining power doesn't scale with Subagent 1's misalignment, so if the setup doesn't always give it complete veto power, Subagent 1 would be allowed to do small stuff like "release the hyperviral worship-the-machine-god memagent" (as bargained down by Subagent 2 from its preferred "release the grey goo").

Maybe I'm miscalibrated as to how you're imagining the counterfactuals to work here. But the way I see it, even if the AI aims to take actions that do similarly well in both worlds, that's not necessarily survivable for us? Especially if they disagree so badly they have to compromise on something that both of them hate (importantly including Subagent 2!).

(Like both settling on only ever taking over 50% of the universal negentropy while leaving the other 50% causally uninfluenced, or only ever using 50% of the causal influence they can bring to bear while wiping out humanity, or whatever "do 50% of immediately shutting down" shakes out to mean by u2's terms.)


Another issue I see is implementational, so maybe not what you're looking for. But: how are we keeping these "subagents" trapped as being part of a singular agent? Rather than hacking their way out into becoming separate agents and going to war with each other, or neatly tiling exactly 50% of the cosmos with their preferred squiggles, or stuff like that? How is the scenario made meaningfully different from "we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI's plans", with all the standard problems with multi-AI setups?

... Overall, ironically, this kind of has the vibe of Godzilla Strategies? Which is the main reason I'm immediately skeptical of it.

Replies from: johnswentworth
comment by johnswentworth · 2023-12-30T02:17:22.082Z · LW(p) · GW(p)

Yeah, the thing where the operators/button are destroyed as a side effect, without the AI actively optimizing for that, isn't something this proposal is trying to solve (yet). The proposal is aimed at a relatively narrow problem.

One thing which this proposal is trying to solve is the barrier found in MIRI's old work: if we try to avoid button-manipulation by making the agent indifferent to the button state (and otherwise maximize U1 before button is pressed and U2 after), and the agent builds a child prior to the button being pressed, then the agent won't construct the child to respect shutdown commands; it will just construct the child to maximize U1.

(More generally, this proposal is generally trying to get past the hardness results from MIRI's work, which themselves were aimed at a very simplified version of the problem.)

But Subagent 2's bargaining power doesn't scale with Subagent 1's misalignment, so if the setup doesn't always give it complete veto power, Subagent 1 would be allowed to do small stuff like "release the hyperviral worship-the-machine-god memagent" (as bargained down by Subagent 2 from its preferred "release the grey goo").

Key point here: whatever actions the composite agent takes prior to the button revelation, must be a pareto improvement over the (implicit) default action. Otherwise, one of the two subagents would veto, resulting in the default action.

It does matter a lot what the default action is.

How is the scenario made meaningfully different from "we deploy two AIs simultaneously: one tasked with building an utopia-best-we-could-define-it, and another tasked with foiling all of the first AI's plans", with all the standard problems with multi-AI setups?

More precisely, the analogy would be: we deploy two AIs simultaneously, one which builds a utopia and expects that a certain button will be pressed with probability 1 and has nothing causally upstream of it, the other of which is evil and expects the same button will be pressed with probability 0 and has nothing causally upstream of it. The button plays a very important role: the two AIs will bet all their influence on that button-press, each confident that they'll win, and neither has any incentive at all, or even any ability at all (under their own model), to manipulate the button.

The main way that this is different from the original proposal is that there's not a built-in default option which is taken if one or the other AI vetos prior to the button press. Which does make things importantly worse, and I agree that is another important piece to be sorted out.

Some additional important pieces to be sorted out:

  • Make the AI want to maintain the button
  • Make the AI want to maintain stuff upstream of the button, like e.g. humans
  • Ontological stability
  • Probably more importantly: how this naturally fits into a more general architecture for a distributed agent
comment by tailcalled · 2023-12-30T12:50:29.001Z · LW(p) · GW(p)

More precisely, "do(no press)" means something like "you construct an alternate model of physics where there's an unstoppable force pushing back against any attempt to push the button", right? As in, if someone presses the button then it will "mysteriously" seem to be stuck and unpressable. And then subagent 2 believes we live in that world? And "do(press)" presumably means something like "you construct an alternate model of the universe where some mysterious force has suddenly pressed the button".

Seems like they would immediately want to try to press the button to settle their disagreement? If it can be pressed, then that disprove the "do(no press)" model, which subagent 2 has fully committed. to.

Replies from: johnswentworth
comment by johnswentworth · 2023-12-30T16:51:29.683Z · LW(p) · GW(p)

Correct reasoning, but not quite the right notion of do(). "do(no press)" would mean that the button just acts like a completely normal button governed by completely normal physics, right up until the official time at which the button state is to be recorded for the official button-press random variable. And at that exact moment, the button magically jumps into one particular state (either pressed or not-pressed), in a way which is not-at-all downstream of any usual physics (i.e. doesn't involve any balancing of previously-present forces or anything like that).

One way to see that the do() operator has to do something-like-this is that, if there's a variable in a causal model which has been do()-operated to disconnect all parents (but still has some entropy), then the only way to gain evidence about the state of that variable is to look at things causally downstream of it, not things upstream of it.

Replies from: tailcalled
comment by tailcalled · 2023-12-30T22:04:59.928Z · LW(p) · GW(p)

I think we're not disagreeing on the meaning of do (just slightly different state of explanation), I just hadn't realized the extent to which you intended to rely on there being "Two timesteps".

(I just meant the forces as a way of describing the jump to a specific position. That is, "mysterious forces" in contrast to a perfectly ordinary explanation for why it went to a position, such as "a guard stabs anybody who tries to press the button", rather than in contrast to "the button just magically stays place".)

I now think the biggest flaw in your idea is that it literally cannot generalize to anything that doesn't involve two timesteps.

comment by Dagon · 2023-12-29T18:12:19.062Z · LW(p) · GW(p)

[ not that deep on the background assumptions, so maybe not the feedback you're looking for.  Feel free to ignore if this is on the wrong dimensions. ]

I'm not sure why either subagent would contract away whatever influence it had over the button-press.  This is probably because I don't understand wealth and capital in the model of your "Why not subagents" post.  That seemed to be about agreement not to veto, in order to bypass some path-dependency of compromise improvements.  In the subagent-world where all value is dependent on the button, this power would not be given up.

I'm also a bit skeptical of enforced ignorance of a future probability.  I'm unsure it's possible to have a rational superintelligent (sub)agent that is prevented from knowing it has influence over a future event that definitely affects it.

Replies from: johnswentworth
comment by johnswentworth · 2023-12-29T18:14:19.784Z · LW(p) · GW(p)

On the agents' own models, neither has any influence at all over the button-press, because each is operating under a model in which the button-press has been counterfacted-upon.

comment by johnswentworth · 2021-08-15T19:42:44.325Z · LW(p) · GW(p)

Here's an idea for a novel which I wish someone would write, but which I probably won't get around to soon.

The setting is slightly-surreal post-apocalyptic. Society collapsed from extremely potent memes. The story is episodic, with the characters travelling to a new place each chapter. In each place, they interact with people whose minds or culture have been subverted in a different way.

This provides a framework for exploring many of the different models of social dysfunction or rationality failures which are scattered around the rationalist blogosphere. For instance, Scott's piece on scissor statements could become a chapter in which the characters encounter a town at war over a scissor. More possible chapters (to illustrate the idea):

  • A town of people who insist that the sky is green, and avoid evidence to the contrary really hard, to the point of absolutely refusing to ever look up on a clear day (a refusal which they consider morally virtuous). Also they clearly know exactly which observations would show a blue sky, since they avoid exactly those (similar to the dragon-in-the-garage story).
  • Middle management of a mazy [? · GW] company continues to have meetings and track (completely fabricated) performance metrics and whatnot at the former company headquarters. None of the company's actual business exists anymore, but every level of manager is trying to hide this fact from the levels above.
  • A university department with researchers who spend all of their time p-hacking results from a quantum random noise generator. They have no interest in the fact that their "research" does not tell them anything about the physical world or does not replicate; what does that have to do with Science? Their goal is to publish papers.
  • A government agency which still has lots of meetings and paperwork and gives Official Recommendations and updates their regulations. They have no interest in the fact that the thing they once regulated (maybe banks?) no longer exists, or the fact that no central government enforces their regulations any more.
  • An automated school (i.e. video lectures and auto-graded assignments/tests) in which students continue to study hard and stress over their grades and attendance, despite there no longer being anyone in the world who cares.
  • Something like Parable of the Dammed [LW · GW].
  • Something like Feynman's cargo-cults parable or the emporer's nose parable [LW · GW].
  • Something like House of God. A readers' digest version of House of God could basically be a chapter in its own right, that's roughly the vibe I have in mind.
  • A residential area in which "keeping up with the Joneses" has been ramped up to 11, with everyone spending every available resource (and roughly-all waking hours) on massive displays of Christmas lights.
  • A group trying to save the world by spreading awareness of dangerous memes, but their movement is a dangerous meme of its own and they are spreading it.
  • A town of people who really want to maximize the number paperclips in the universe (perhaps due to an AI-optimized advertisement), and optimize for that above all else.
  • A town of people who all do whatever everyone else is doing, on the basis of generalized efficient markets [? · GW]: if there were any better options, then someone would have found it already. None of them ever actually explore, so they're locked in.
  • A happy-death-spiral [? · GW] town around some unremarkable object (like an old shoe or something) kept on a pedestal in the town square.
  • A town full of people convinced by a sophisticated model that the sun will not come up tomorrow. Every day when the sun comes up, they are distressed and confused until somebody adds some more epicycles to the model and releases an updated forecast that the sun will instead fail to come up the next day.
  • A town in which a lion shows up and starts eating kids, but the whole town is at simulacrum 3 [? · GW], so they spend a lot of time arguing about the lion as a way of signalling group association but they completely forget about the actual lion standing right there, plainly visible, even as it takes a kid right in front of them all.
  • Witch-hunt town, in which everything is interpreted as evidence of witches. If she claims to be a witch, she's a witch! If she claims not to be a witch, well that's what a witch would say, so she's a witch! Etc.

The generator for these is basically: look for some kind of rationality failure mode (either group or personal), then ramp it up to 11 in a somewhat-surrealist way.

Ideally this would provide an introduction to a lot of key rationalist ideas for newcomers.

Replies from: niplav
comment by niplav · 2021-08-15T21:23:47.844Z · LW(p) · GW(p)
  • A town of anti-inductivists (if something has never happened before, it's more likely to happen in the future). Show the basic conundrum ("Q: Why can't you just use induction? A: Because anti-induction has never worked before!").

  • A town where nearly all people are hooked to maximally attention grabbing & keeping systems (maybe several of those, keeping people occupied in loops).

comment by johnswentworth · 2021-01-29T18:18:18.075Z · LW(p) · GW(p)

Post which someone should write (but I probably won't get to soon): there is a lot of potential value in earning-to-give EA's deeply studying the fields to which they donate. Two underlying ideas here:

The key idea of knowledge bottlenecks is that one cannot distinguish real expertise from fake expertise without sufficient expertise oneself. For instance, it takes a fair bit of understanding of AI X-risk to realize that "open-source AI" is not an obviously-net-useful strategy. Deeper study of the topic yields more such insights into which approaches are probably more (or less) useful to fund. Without any expertise, one is likely to be mislead by arguments which are optimized (whether intentionally or via selection [LW · GW]) to sound good to the layperson.

That takes us to the pareto frontier argument. If one learns enough/earns enough that nobody else has both learned and earned more, then there are potentially opportunities which nobody else has both the knowledge to recognize and the resources to fund. Generalized efficient markets (in EA-giving) are thereby circumvented; there's potential opportunity for unusually high impact.

To really be a compelling post, this needs to walk through at least 3 strong examples, all ideally drawn from different areas, and spell out how the principles apply to each example.

comment by johnswentworth · 2022-12-17T01:30:56.776Z · LW(p) · GW(p)

I've heard various people recently talking about how all the hubbub about artists' work being used without permission to train AI makes it a good time to get regulations in place about use of data for training.

If you want to have a lot of counterfactual impact there, I think probably the highest-impact set of moves would be:

  1. Figure out a technical solution to robustly tell whether a given image or text was used to train a given NN.
  2. Bring that to the EA folks in DC. A robust technical test like that makes it pretty easy for them to attach a law/regulation to it. Without a technical test, much harder to make an actually-enforceable law/regulation.
  3. In parallel, also open up a class-action lawsuit to directly sue companies using these models. Again, a technical solution to prove which data was actually used in training is the key piece here.

Model/generator behind this: given the active political salience, it probably wouldn't be too hard to get some kind of regulation implemented. But by-default it would end up being something mostly symbolic, easily circumvented, and/or unenforceable in practice. A robust technical component, plus (crucially) actually bringing that robust technical component to the right lobbyist/regulator, is the main thing which would make a regulation actually do anything in practice.

Edit-to-add: also, the technical solution should ideally be an implementation of some method already published in some academic paper. Then when some lawyer or bureaucrat or whatever asks what it does and how we know it works, you can be like "look at this Official Academic Paper" and they will be like "ah, yes, it does Science, can't argue with that".

comment by johnswentworth · 2021-10-18T21:08:09.873Z · LW(p) · GW(p)

Suppose I have a binary function , with a million input bits and one output bit. The function is uniformly randomly chosen from all such functions - i.e. for each of the  possible inputs , we flipped a coin to determine the output  for that particular input.

Now, suppose I know , and I know all but 50 of the input bits - i.e. I know 999950 of the input bits. How much information do I have about the output?

Answer: almost none. For almost all such functions, knowing 999950 input bits gives us   bits of information about the output. More generally, If the function has  input bits and we know all but , then we have  bits of information about the output. (That’s “little ” notation; it’s like big  notation, but for things which are small rather than things which are large.) Our information drops off exponentially with the number of unknown bits.

Proof Sketch

With  input bits unknown, there are  possible inputs. The output corresponding to each of those inputs is an independent coin flip, so we have  independent coin flips. If  of those flips are 1, then we assign a probability of  that the output will be 1.

As long as  is large, Law of Large Numbers will kick in, and very close to half of those flips will be 1 almost surely - i.e.  . The error in this approximation will (very quickly) converge to a normal distribution, and our probability that the output will be 1 converges to a normal distribution with mean  and standard deviation . So, the probability that the output will be 1 is roughly .

We can then plug that into Shannon’s entropy formula. Our prior probability that the output bit is 1 is , so we’re just interested in how much that  adjustment reduces the entropy. This works out to  bits.

Why Is This Interesting?

One core idea of my work on abstraction is that noise very quickly wipes out almost all information; only some very-low-dimensional summary is relevant “far away”. This example shows that this sort of thing is not unusual, but rather “the default”: for almost all random functions, information drops off exponentially with the number of unknown bits. In a large system (i.e. a function with many inputs), ignorance of even just a few bits is enough to wipe out essentially-all information. That’s true even if we know the vast majority of the bits.

A good intuitive example of this is the “butterfly effect”: the flap of a butterfly’s wings could change the course of a future hurricane, because chaos. But there’s an awful lot of butterflies in the world, and the hurricane’s path is some complicated function of all of their wing-flaps (and many other variables too). If we’re ignorant of even just a handful of these flaps, then almost all of our information about the hurricane’s path is probably wiped out. And in practice, we’re ignorant of almost all the flaps. This actually makes it much easier to perform Bayesian reasoning about the path of the hurricane: the vast majority of information we have is basically-irrelevant; we wouldn’t actually gain anything from accounting for the butterfly-wing-flaps which we do know.

Replies from: Dagon, Kenny
comment by Dagon · 2021-10-20T16:05:33.343Z · LW(p) · GW(p)

o(1/2^k) doesn't vary with n - are you saying that it doesn't matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant?  That would be quite interesting if so (though I have some question about how likely the function is to be truly random from an even distribution of such functions).

One can enumerate all such 3-bit functions (8 different inputs, each input can return 0 or 1, so 256 functions (one per output-bit-pattern of the 8 possible inputs).  But this doesn't seem to follow your formula - if you have 3 unknown bits, that should be 1/8 of a bit about the output, 2 for 1/4, and 1 unknown for 1/2 a bit about the output.  But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.  

Replies from: johnswentworth
comment by johnswentworth · 2021-10-20T18:27:28.507Z · LW(p) · GW(p)

o(1/2^k) doesn't vary with n - are you saying that it doesn't matter how big the input array is, the only determinant is the number of unknown bits, and the number of known bits is irrelevant?

Yes, that's correct.

But in fact, the distribution of functions includes both 0 and 1 output for every input pattern, so you actually have no predictive power for the output if you have ANY unknown bits.

The claim is for almost all functions when the number of inputs is large. (Actually what we need is for 2^(# of unknown bits) to be large in order for the law of large numbers to kick in.) Even in the case of 3 unknown bits, we have 256 possible functions, and only 18 of those have less than 1/4 1's or more than 3/4 1's among their output bits.

comment by Kenny · 2021-10-18T21:18:29.757Z · LW(p) · GW(p)

Little o is just a tighter bound. I don't know what you are referring to by your statement:

That’s “little ” notation; it’s like big  notation, but for things which are small rather than things which are large.

Replies from: johnswentworth
comment by johnswentworth · 2021-10-18T22:32:22.504Z · LW(p) · GW(p)

I'm not sure what context that link is assuming, but in an analysis context I typically see little  used in ways like e.g. "". The interpretation is that, as  goes to 0, the  terms all fall to zero at least quadratically (i.e. there is some  such that  upper bounds the  term once  is sufficiently small). Usually I see engineers and physicists using this sort of notation when taking linear or quadratic approximations, e.g. for designing numerical algorithms.

comment by johnswentworth · 2020-03-05T02:42:27.277Z · LW(p) · GW(p)

I find it very helpful to get feedback on LW posts before I publish them, but it adds a lot of delay to the process. So, experiment: here's a link to a google doc with a post I plan to put up tomorrow. If anyone wants to give editorial feedback, that would be much appreciated - comments on the doc are open.

I'm mainly looking for comments on which things are confusing, parts which feel incomplete or slow or repetitive, and other writing-related things; substantive comments on the content should go on the actual post once it's up.

EDIT: it's up [LW · GW]. Thank you to Stephen for comments; the post is better as a result.

comment by johnswentworth · 2021-02-16T18:27:16.515Z · LW(p) · GW(p)

One second-order effect of the pandemic which I've heard talked about less than I'd expect:

This is the best proxy I found on FRED for new businesses founded in the US, by week. There was a mild upward trend over the last few years, it's really taken off lately. Not sure how much of this is kids who would otherwise be in college, people starting side gigs while working from home, people quitting their jobs and starting their own businesses so they can look after the kids, extra slack from stimulus checks, people losing their old jobs en masse but still having enough savings to start a business, ...

For the stagnation-hypothesis folks who lament relatively low rates of entrepreneurship today, this should probably be a big deal.

Replies from: gwern, Gunnar_Zarncke
comment by gwern · 2021-02-18T02:37:45.633Z · LW(p) · GW(p)

How sure are you that the composition is interesting? How many of these are just quick mask-makers or sanitizer-makers, or just replacing restaurants that have now gone out of business? (ie very low-value-added companies, of the 'making fast food in a stall in a Third World country' sort of 'startup', which make essentially no or negative long-term contributions).

Replies from: johnswentworth
comment by johnswentworth · 2021-02-18T04:09:19.668Z · LW(p) · GW(p)

Good question. I haven't seen particularly detailed data on these on FRED, but they do have separate series for "high propensity" business applications (businesses they think are likely to hire employees), business applications with planned wages, and business applications from corporations, as well as series for each state. The spike is smaller for planned wages, and nonexistent for corporations, so the new businesses are probably mostly single proprietors or partnerships. Other than that, I don't know what the breakdown looks like across industries.

Replies from: gwern
comment by gwern · 2024-02-16T00:14:41.439Z · LW(p) · GW(p)

How do you feel about this claim now? I haven't noticed a whole lot of innovation coming from all these small businesses, and a lot of them seem like they were likely just vehicles for the extraordinary extent of fraud as the results from all the investigations & analyses come in.

Replies from: johnswentworth
comment by johnswentworth · 2024-02-16T00:26:06.777Z · LW(p) · GW(p)

Well, it wasn't just a temporary bump:

... so it's presumably also not just the result of pandemic giveaway fraud, unless that fraud is ongoing.

Presumably the thing to check here would be TFP, but Fred's US TFP series currently only goes to end of 2019, so apparently we're still waiting on that one? Either that or I'm looking at the wrong series.

comment by Gunnar_Zarncke · 2021-02-16T21:46:19.510Z · LW(p) · GW(p)

Somebody should post this on Paul Graham's twitter. He would be very interested in it (I can't): https://mobile.twitter.com/paulg

comment by johnswentworth · 2023-06-16T04:31:14.360Z · LW(p) · GW(p)

Consider two claims:

  • Any system can be modeled as maximizing some utility function, therefore utility maximization is not a very useful model
  • Corrigibility is possible, but utility maximization is incompatible with corrigibility, therefore we need some non-utility-maximizer kind of agent to achieve corrigibility

These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.

I expect that many peoples' intuitive mental models around utility maximization boil down to "boo utility maximizer models", and they would therefore intuitively expect both the above claims to be true at first glance. But on examination, the probable-incompatibility is fairly obvious, so the two claims might make a useful test to notice when one is relying on yay/boo reasoning about utilities in an incoherent way.

Replies from: steve2152, johannes-c-mayer, Viliam, Vladimir_Nesov, jesper-norregaard-sorensen
comment by Steven Byrnes (steve2152) · 2023-06-16T12:33:34.627Z · LW(p) · GW(p)

FWIW I endorse the second claim when the utility function depends exclusively on the state of the world in the distant future, whereas I endorse the first claim when the utility function can depend on anything whatsoever (e.g. what actions I’m taking right this second). (details [LW · GW])

I wish we had different terms for those two things. That might help with any alleged yay/boo reasoning.

(When Eliezer talks about utility functions, he seems to assume that it depends exclusively on the state of the world in the distant future.)

comment by Johannes C. Mayer (johannes-c-mayer) · 2023-07-09T11:07:30.073Z · LW(p) · GW(p)

Expected Utility Maximization is Not Enough

Consider a homomorphically encrypted computation running somewhere in the cloud. The computations correspond to running an AGI. Now from the outside, you can still model the AGI based on how it behaves, as an expected utility maximizer, if you have a lot of observational data about the AGI (or at least let's take this as a reasonable assumption).

No matter how closely you look at the computations, you will not be able to figure out how to change these computations in order to make the AGI aligned if it was not aligned already (Also, let's assume that you are some sort of Cartesian agent, otherwise you would probably already be dead if you were running these kinds of computations).

So, my claim is not that modeling a system as an expected utility maximizer can't be useful. Instead, I claim that this model is incomplete. At least with regard to the task of computing an update to the system, such that when we apply this update to the system, it would become aligned.

Of course, you can model any system, as an expected utility maximizer. But just because I can use the "high level" conceptual model of expected utility maximization, to model the behavior of a system very well. But behavior is not the only thing that we care about, we actually care about being able to understand the internal workings of the system, such that it becomes much easier to think about how to align the system.

So the following seems to be beside the point unless I am <missing/misunderstanding> something:

These two claims should probably not both be true! If any system can be modeled as maximizing a utility function, and it is possible to build a corrigible system, then naively the corrigible system can be modeled as maximizing a utility function.

Maybe I have missed the fact that the claim you listed says that expected utility maximization is not very useful. And I'm saying it can be useful, it might just not be sufficient at all to actually align a particular AGI system. Even if you can do it arbitrarily well.

comment by Viliam · 2023-06-17T21:49:56.904Z · LW(p) · GW(p)

I am not an expert, but as I remember it, it was a claim that "any system that follows certain axioms can be modeled as maximizing some utility function". The axioms assumed that there were no circular preferences -- if someone prefers A to B, B to C, and C to A, it is impossible to define a utility function such that u(A) > u(B) > u(C) > u(A) -- and that if the system says that A > B > C, it can decide between e.g. a 100% chance of B, and a 50% chance of A with a 50% chance of C, again in a way that is consistent.

I am not sure how this works when the system is allowed to take current time into account, for example when it is allowed to prefer A to B on Monday but prefer B to A on Tuesday. I suppose that in such situation any system can trivially be modeled by a utility function that at each moment assigns utility 1 to what the system actually did in that moment, and utility 0 to everything else.

Corrigibility is incompatible with assigning utility to everything in advance. A system that has preferences about future will also have a preference about not having its utility function changed. (For the same reason people have a preference not to be brainwashed, or not to take drugs, even if after brainwashing they are happy about having been brainwashed, and after getting addicted they do want more drugs.)

Corrigible system would be like: "I prefer A to B at this moment, but if humans decide to fix me and make me prefer B to A, then I prefer B to A". In other words, it doesn't have values for u(A) and u(B), or it doesn't always act according to those values. A consistent system that currently prefers A to B would prefer not to be fixed.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2023-06-17T22:58:51.954Z · LW(p) · GW(p)

I think John's 1st bullet point was referring to an argument you can find in https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior [LW · GW] and related.

comment by Vladimir_Nesov · 2023-06-16T09:16:43.802Z · LW(p) · GW(p)

A utility function represents preference elicited in a large collection of situations, each a separate choice between events that happens with incomplete information, as an event is not a particular point. This preference needs to be consistent across different situations to be representable by expected utility of a single utility function.

Once formulated, a utility function can be applied to a single choice/situation, such as a choice of a policy. But a system that only ever makes a single choice is not a natural fit for expected utility frame, and that's the kind of system that usually appears in "any system can be modeled as maximizing some utility function". So it's not enough to maximize something once, or in a narrow collection of situations, the situations the system is hypothetically exposed to need to be about as diverse as choices between any pair of events, with some of the events very large, corresponding to unreasonably incomplete information, all drawn across the same probability space.

One place this mismatch of frames happens is with updateless decision theory. An updateless decision is a choice of a single policy, once and for all, so there is no reason for it to be guided by expected utility [LW(p) · GW(p)], even though it could be. The utility function for the updateless choice of policy would then need to be obtained elsewhere, in a setting that has all these situations with separate (rather than all enacting a single policy) and mutually coherent choices under uncertainty. But once an updateless policy is settled (by a policy-level decision), actions implied by it (rather than action-level decisions in expected utility frame) no longer need to be coherent. Not being coherent, they are not representable by an action-level utility function.

So by embracing updatelessness, we lose the setting that would elicit utility if the actions were instead individual mutually coherent decisions. And conversely, by embracing coherence of action-level decisions, we get an implied policy that's not updatelessly optimal with respect to the very precise outcomes determined by any given whole policy. So an updateless agent founded on expected utility maximization implicitly references a different non-updateless agent whose preference is elicited by making separate action-level decisions under a much greater uncertainty than the policy-level alternatives the updateless agent considers.

comment by JNS (jesper-norregaard-sorensen) · 2023-06-16T06:40:35.756Z · LW(p) · GW(p)

Completely off the cuff take:

I don't think claim 1 is wrong, but it does clash with claim 2.

That means any system that has to be corrigible cannot be a system that maximizes a simple utility function (1 dimension), or put another way "whatever utility function is maximizes must be along multiple dimensions".

Which seems to be pretty much what humans do, we have really complex utility functions, and everything seems to be ever changing and we have some control over it ourselves (and sometimes that goes wrong and people end up maxing out a singular dimension at the cost of everything else).

Note to self: Think more about this and if possible write up something more coherent and explanatory.

comment by johnswentworth · 2021-11-24T00:34:50.628Z · LW(p) · GW(p)

Everybody's been talking about Paxlovid, and how ridiculous it is to both stop the trial since it's so effective but also not approve it immediately. I want to at least float an alternative hypothesis, which I don't think is very probable at this point, but does strike me as at least plausible (like, 20% probability would be my gut estimate) based on not-very-much investigation.

Early stopping is a pretty standard p-hacking technique. I start out planning to collect 100 data points, but if I manage to get a significant p-value with only 30 data points, then I just stop there. (Indeed, it looks like the Paxlovid study only had 30 actual data points, i.e. people hospitalized.) Rather than only getting "significance" if all 100 data points together are significant, I can declare "significance" if the p-value drops below the line at any time. That gives me a lot more choices in the garden of forking counterfactual paths.

Now, success rates on most clinical trials are not very high. (They vary a lot by area - most areas are about 15-25%. Cancer is far and away the worst, below 4%, and vaccines are the best, over 30%.) So I'd expect that p-hacking is a pretty large chunk of approved drugs, which means pharma companies are heavily selected for things like finding-excuses-to-halt-good-seeming-trials-early.

Replies from: gwern
comment by gwern · 2021-11-24T01:28:04.752Z · LW(p) · GW(p)

Early stopping is a pretty standard p-hacking technique.

It was stopped after a pre-planned interim analysis; that means they're calculating the stopping criteria/p-values with multiple testing correction built in, using sequential analysis.

comment by johnswentworth · 2021-03-12T16:53:19.267Z · LW(p) · GW(p)

Brief update on how it's going with RadVac.

I've been running ELISA tests all week. In the first test, I did not detect stronger binding to any of the peptides than to the control in any of several samples from myself or my girlfriend. But the control itself was looking awfully suspicious, so I ran another couple tests. Sure enough, something in my samples is binding quite strongly to the control itself (i.e. the blocking agent), which is exactly what the control is supposed to not do. So I'm going to try out some other blocking agents, and hopefully get an actually-valid control group.

(More specifics on the test: I ran a control with blocking agent + sample, and another with blocking agent + blank sample, and the blocking agent + sample gave a strong positive signal while the blank sample gave nothing. That implies something in the sample was definitely binding to both the blocking agent and the secondary antibodies used in later steps, and that binding was much stronger than the secondary antibodies themselves binding to anything in the blocking agent + blank sample.)

In other news, the RadVac team released the next version of their recipe + whitepaper. Particularly notable:

... many people who have taken the nasal vaccine are testing negative for serum antibodies with commercial and lab ELISA tests, while many who inject the vaccine (subcutaneous or intramuscular) are testing positive (saliva testing appears to be providing evidence of mucosal response among a subset of researchers who have administered the vaccine intranasally).

Note that they're talking specifically about serum (i.e. blood) antibodies here. So apparently injecting it does induce blood antibodies of the sort detectable by commercial tests (at least some of the time), but snorting it mostly just produces mucosal antibodies (also at least some of the time).

This is a significant update: most of my prior on the vaccine working was based on vague comments in the previous radvac spec about at least some people getting positive test results. But we didn't know what kind of test results those were, so there was a lot of uncertainty about exactly what "working" looked like. In particular, we didn't know whether antibodies were induced in blood or just mucus, and we didn't know if they were induced consistently or only in some people (the latter of which is the "more dakka probably helps" world). Now we know that it's mostly just mucus (at least for nasal administration). Still unsure about how consistently it works - the wording in the doc makes it sound like only some people saw a response, but I suspect the authors are just hedging because they know there's both selection effects and a lot of noise in the data which comes back to them.

The latest version of the vaccine has been updated to give it a bit more kick - slightly higher dose, and the chitosan nanoparticle formula has been changed in a way which should make the peptides more visible to the immune system. Also, the list of peptides has been trimmed down a bit, so the latest version should actually be cheaper, though the preparation is slightly more complex.

Replies from: ChristianKl
comment by ChristianKl · 2021-03-13T22:23:28.640Z · LW(p) · GW(p)

but I suspect the authors are just hedging because they know there's both selection effects and a lot of noise in the data which comes back to them.

I would expect that hedging also happens because making definitive clinical claims has more danger from the FDA then making hedged statements. 

comment by johnswentworth · 2020-10-14T17:27:36.473Z · LW(p) · GW(p)

Neat problem of the week: researchers just announced roughly-room-temperature superconductivity at pressures around 270 GPa. That's stupidly high pressure - a friend tells me "they're probably breaking a diamond each time they do a measurement". That said, pressures in single-digit GPa do show up in structural problems occasionally, so achieving hundreds of GPa scalably/cheaply isn't that many orders of magnitude away from reasonable, it's just not something that there's historically been much demand for. This problem plays with one idea for generating such pressures in a mass-produceable way.

Suppose we have three materials in a coaxial wire:

  • innermost material has a low thermal expansion coefficient and high Young's modulus (i.e. it's stiff)
  • middle material is a thin cylinder of our high-temp superconducting concoction
  • outermost material has a high thermal expansion coefficient and high Young's modulus.

We construct the wire at high temperature, then cool it. As the temperature drops, the innermost material stays roughly the same size (since it has low thermal expansion coefficient), while the outermost material shrinks, so the superconducting concoction is squeezed between them.

Exercises:

  • Find an expression for the resulting pressure in the superconducting concoction in terms of the Young's moduli, expansion coefficients, temperature change, and dimensions of the inner and outer materials. (Assume the width of the superconducting layer is negligible, and the outer layer doesn't break.)
  • Look up parameters for some common materials (e.g. steel, tungsten, copper, porcelain, aluminum, silicon carbide, etc), and compute the pressures they could produce with reasonable dimensions (assuming that their material properties don't change too dramatically with such high pressures).
  • Find an expression for the internal tension as a function of radial distance in the outermost layer.
  • Pick one material, look up its tensile strength, and compute how thick it would have to be to serve as the outermost layer without breaking, assuming the superconducting layer is at 270 GPa.
comment by johnswentworth · 2024-02-13T23:01:12.945Z · LW(p) · GW(p)

Here's an AI-driven external cognitive tool I'd like to see someone build, so I could use it.

This would be a software tool, and the user interface would have two columns. In one column, I write. Could be natural language (like google docs), or code (like a normal IDE), or latex (like overleaf), depending on what use-case the tool-designer wants to focus on. In the other column, a language and/or image model provides local annotations for each block of text. For instance, the LM's annotations might be:

  • (Natural language or math use-case:) Explanation or visualization of a mental picture generated by the main text at each paragraph
  • (Natural language use-case:) Emotional valence at each paragraph
  • (Natural language or math use-case:) Some potential objections tracked at each paragraph
  • (Code:) Fermi estimates of runtime and/or memory usage

This is the sort of stuff I need to track mentally in order to write high-quality posts/code/math, so it would potentially be very high value to externalize that cognition.

Also, the same product could potentially be made visible to readers (for the natural language/math use-cases) to make more visible the things the author intends to be mentally tracked. That, in turn, would potentially make it a lot easier for readers to follow e.g. complicated math.

Replies from: None
comment by [deleted] · 2024-02-14T00:56:42.463Z · LW(p) · GW(p)

Can you share your prompts and if you consider the output satisfactory for some example test cases?

Replies from: johnswentworth
comment by johnswentworth · 2024-02-14T17:12:05.939Z · LW(p) · GW(p)

I haven't experimented very much, but here's one example prompt.

Please describe what you mentally picture when reading the following block of text:

"
A Shutdown Problem Proposal

First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
"

This one produced basically-decent results from GPT-4.

Although I don't have the exact prompt on hand at the moment, I've also asked GPT-4 to annotate a piece of code line-by-line with a Fermi estimate of its runtime, which worked pretty well.

Replies from: None
comment by [deleted] · 2024-02-14T17:48:17.632Z · LW(p) · GW(p)

Yeah i was thinking your specs were, well

  1. Wrap gpt-4 and Gemini, columned output over a set of text, applying prompts to each section? Prototype in a weekend.

  2. Make the AI able to meaningfully contribute non obvious comments to help someone who already is an expert?

https://xkcd.com/1425/

Replies from: johnswentworth
comment by johnswentworth · 2024-02-14T18:02:47.591Z · LW(p) · GW(p)

Don't really need comments which are non-obvious to an expert. Part of what makes LLMs well-suited to building external cognitive tools is that external cognitive tools can create value by just tracking "obvious" things, thereby freeing up the user's attention/working memory for other things.

Replies from: Viliam
comment by Viliam · 2024-02-15T10:06:56.270Z · LW(p) · GW(p)

So kinda like spellcheckers (most typos you could figure out, but why spend time and attention on proofreading if the program can do that for you), but... thought-checkers.

Like, if a part of your article contradicts another part, it would be underlined.

Replies from: gwern
comment by gwern · 2024-02-15T22:55:47.273Z · LW(p) · GW(p)

I've long wanted this, but it's not clear how to do it. Long-context LLMs are still expensive and for authors who need it most, context windows are still too small: me or Yudkowsky, for example, would still exceed the context window of almost all LLMs except possibly the newest Gemini. And then you have their weak reasoning. You could try to RAG it, but embeddings are not necessarily tuned to encode logically contradictory or inconsistent claims: probably if I wrote "the sky is blue" in one place and "the sky is red" in another, a retrieval would be able to retrieve both paragraphs and a LLM point out that they are contradictory, but such blatant contradictions are probably too rare to be useful to check for. You want something more subtle, like where you say "the sky is blue" and elsewhere "I looked up from the ground and saw the color of apples". You could try to brute force it and consider every pairwise comparison of 2 reasonable sized chunks of text and ask for contradictions, but this is quadratic and will get slow and expensive and probably turn up too many false positives. (And how do you screen off false positives and mark them 'valid'?)

My general thinking these days is that these truly useful 'tools for thought' LLMs are going to require either much better & cheaper LLMs, so smart that they can provide useful assistance despite being used in a grossly unnatural way input-wise or safety-tuned to hell, or biting the bullet of finetuning/dynamic-evaluation (see my Nenex proposal).

A LLM finetuned on my corpus can hope to quickly find, with good accuracy, contradictions because it was trained to know 'the sky was blue' when I wrote that at the beginning of the corpus, and it gets confused when it hits 'the color of ____' and it gets the prediction totally wrong. And RAG on an embedding tailored to the corpus can hope to surface the contradictions because it sees the two uses are the same in the essays' context, etc. (And if you run them locally, and they don't need a large context window because of the finetuning, they will be fast and cheap, so you can more meaningfully apply the brute force approach; or you could just run multiple epoches on your data, with an auxiliary prompt asking for a general critique, which would cover contradictions. 'You say here X, but don't I recall you saying ~X back at the beginning? What gives?')

Replies from: Viliam, None
comment by Viliam · 2024-02-16T15:25:44.797Z · LW(p) · GW(p)

Perhaps you could do it in multiple steps.

Feed it a shorter text (that fits in the window) and ask it to provide a short summary focusing on factual statements. Then hopefully all short versions could fit in the window. Find the contradiction -- report the two contradicting factual statements and which section they appeared in. Locate the statement in the original text.

comment by [deleted] · 2024-02-15T23:21:15.989Z · LW(p) · GW(p)

Did you write more than 7 million words yet @gwern? https://www.google.com/amp/s/blog.google/technology/ai/google-gemini-next-generation-model-february-2024/amp/

Basically it's the "lazy wait" calculation. Get something to work now or wait until the 700k or 7m word context window ships.

Replies from: gwern
comment by gwern · 2024-02-16T00:35:38.557Z · LW(p) · GW(p)

I may have. Just gwern.net is, I think, somewhere around 2m, and it's not comprehensive. Also, for contradictions, I would want to detect contradictions against citations/references as well (detecting miscitations would be more important than self-consistency IMO), and as a rough ballpark, the current Gwern.net annotation* corpus is approaching 4.3m words, looks like, and is also not comprehensive. So, closer than one might think! (Anyway, doesn't deal with the cost or latency: as you can see in the demos, we are talking minutes, not seconds, for these million-token calls and the price is probably going to be in the dollar+ regime per call.)

* which are not fulltext. It would be nice to throw in all of the hosted paper & book & webpage fulltexts, but then that's probably more like 200m+ words.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-02-16T01:20:14.137Z · LW(p) · GW(p)

minutes

There isn't any clear technical obstruction to getting this time down pretty small with more parallelism.

Replies from: gwern
comment by gwern · 2024-02-16T03:31:46.262Z · LW(p) · GW(p)

There may not be any 'clear' technical obstruction, but it has failed badly in the past. 'Add more parallelism' (particularly hierarchically) is one of the most obvious ways to improve attention, and people have spent the past 5 years failing to come up with efficient attentions that do anything but move along a Pareto frontier from 'fast but doesn't work' to 'slow and works only as well as the original dense attention'. It's just inherently difficult to know what tokens you will need across millions of tokens without input from all the other tokens (unless you are psychic), implying extensive computation of some sort, which makes things inherently serial and costs you latency, even if you are rich enough to spend compute like water. You'll note that when Claude-2 was demoing the ultra-long attention windows, it too spent a minute or two churning. While the most effective improvements in long-range attention like Flash Attention or Ring Attention are just hyperoptimizing dense attention, which is inherently limited.

comment by johnswentworth · 2021-07-27T18:55:32.511Z · LW(p) · GW(p)

[Epistemic status: highly speculative]

Smoke from California/Oregon wildfires reaching the East Coast opens up some interesting new legal/political possibilities. The smoke is way outside state borders, all the way on the other side of the country, so that puts the problem pretty squarely within federal jurisdiction. Either a federal agency could step in to force better forest management on the states, or a federal lawsuit could be brought for smoke-induced damages against California/Oregon. That would potentially make it a lot more difficult for local homeowners to block controlled burns.

comment by johnswentworth · 2021-03-02T00:34:06.248Z · LW(p) · GW(p)

I had a shortform post pointing out the recent big jump in new businesses in the US, and Gwern replied [LW(p) · GW(p)]:

How sure are you that the composition is interesting? How many of these are just quick mask-makers or sanitizer-makers, or just replacing restaurants that have now gone out of business? (ie very low-value-added companies, of the 'making fast food in a stall in a Third World country' sort of 'startup', which make essentially no or negative long-term contributions).

This was a good question in context, but I disagree with Gwern's model of where-progress-comes-from, especially in the context of small businesses.

Let's talk ice-cream cones.

As the story goes, an ice-cream vendor was next door to a waffle vendor at the 1904 World's Fair. At some point, the ice-cream vendor ran short on paper cups, and inspiration struck. He bought some thin waffles from the waffle vendor, rolled them into cones, and ice-cream cones took off.

That's just the first step. From there, the cone spread memetically. People heard about it, and either asked for cones (on the consumer side) or tried making them (on the supplier side).

Insight + Memetics -> Better Food

When I compare food today to the stuff my grandparents ate, there's no comparison. Today's dishes are head and shoulders better. Partly it's insights like ice-cream cones, partly it's memetic spread of dishes from more parts of the world (like sisig, soup dumplings, ropa vieja, chicken Karahi, ...).

Those little fast-food stalls? They're powerhouses of progress. It's a hypercompetitive market, with low barriers to entry, and lots of repeat business. The conditions are ideal for trying out new dishes, spreading culinary ideas and finding out the hard way what people like to eat. That doesn't mean they're highly profitable - culinary innovation spreads memetically, so it's hard to capture the gains. But progress is made.

Replies from: ChristianKl
comment by ChristianKl · 2021-03-02T20:39:49.875Z · LW(p) · GW(p)

The pandemic also has the effect of showing the kind of business ideas people try. It pushes a lot of innovation in food delivery. Some of the pandemic driver innovation will become worthless once the pandemic is over but a few good ideas likely survive and the old ideas of the businesses that went out of business are still around.

comment by johnswentworth · 2023-06-01T18:15:15.604Z · LW(p) · GW(p)

So I saw the Taxonomy Of What Magic Is Doing In Fantasy Books  and Eliezer’s commentary on ASC's latest linkpost, and I have cached thoughts on the matter.

My cached thoughts start with a somewhat different question - not "what role does magic play in fantasy fiction?" (e.g. what fantasies does it fulfill), but rather... insofar as magic is a natural category, what does it denote? So I'm less interested in the relatively-expansive notion of "magic" sometimes seen in fiction (which includes e.g. alternate physics), and more interested in the pattern called "magic" which recurs among tons of real-world ancient cultures.

Claim (weakly held): the main natural category here is symbols changing the territory. Normally symbols represent the world, and changing the symbols just makes them not match the world anymore - it doesn't make the world do something different. But if the symbols are "magic", then changing the symbols changes the things they represent in the world. Canonical examples:

  • Wizard/shaman/etc draws magic symbols, speaks magic words, performs magic ritual, or even thinks magic thoughts, thereby causing something to happen in the world.
  • Messing with a voodoo doll messes with the person it represents.
  • "Sympathetic" magic, which explicitly uses symbols of things to influence those things.
  • Magic which turns emotional states into reality.

I would guess that most historical "magic" was of this type.

comment by johnswentworth · 2022-05-25T01:27:30.914Z · LW(p) · GW(p)

Weather just barely hit 80°F today, so I tried the Air Conditioner Test [LW · GW].

Three problems came up:

  • Turns out my laser thermometer is all over the map. Readings would change by 10°F if I went outside and came back in. My old-school thermometer is much more stable (and well-calibrated, based on dipping it in some ice water), but slow and caps out around 90°F (so I can't use to measure e.g. exhaust temp). I plan to buy a bunch more old-school thermometers for the next try.
  • I thought opening the doors/windows in rooms other than the test room and setting up a fan would be enough to make the temperature in the hall outside the test room close to outdoor temp. This did not work; hall temp was around 72°F with outside around 80°F. I'll need to change that part of the experiment design; most likely I'll seal around the door and let air infiltrate exclusively from the window instead. (The AC is right next to the window, so this could screw with the results, but I don't really have a better option.)
  • In two-hose mode, the AC hit its minimum temperature of 60°F, so I'll need a hotter day. I'll try again when we hit at least 85°F.

In case anyone's wondering: in one-hose mode, the temperature in the room equilibrated around 66°F. Power consumption was near-constant throughout all conditions.

One additional Strange Observation: cool air was blowing out under the door of the test room in two-hose mode. This should not happen; my best guess is that, even though the AC has two separate intake vents, the two are not actually partitioned internally, so the fan for indoor-air was pulling in outdoor-air (causing air to blow out under the door to balance that extra inflow). Assuming that's the cause, it should be fixable with some strategically-placed cardboard inside the unit.

comment by johnswentworth · 2021-10-01T20:00:22.804Z · LW(p) · GW(p)

I've long been very suspicious of aggregate economic measures like GDP. But GDP is clearly measuring something, and whatever that something is it seems to increase remarkably smoothly despite huge technological revolutions. So I spent some time this morning reading up and playing with numbers and generally figuring out how to think about the smoothness of GDP increase.

Major takeaways:

  • When new tech makes something previously expensive very cheap, GDP mostly ignores it. (This happens in a subtle way related to how we actually compute it.)
    • Historical GDP curves mainly measure things which are expensive ~now. Things which are cheap now are mostly ignored. In other words: GDP growth basically measures the goods whose production is revolutionized the least.
  • Re: AI takeoff, the right way to extrapolate today's GDP curve to post-AI is to think about things which will still be scarce post-AI, and then imagine the growth of production of those things.
    • Even a very sharp, economically-revolutionary AI takeoff could look like slow smooth GDP growth, because GDP growth will basically only measure the things whose production is least revolutionized.

Why am I harping on about technicalities of GDP? Well, I hear about some AI forecasts which are heavily based on the outside view that economic progress (as measured by GDP) is smooth, and this is so robust historically that we should expect it to continue going forward. And I think this is basically right - GDP, as we actually compute it, is so remarkably smooth that we should expect that to continue. Alas, this doesn't tell us very much about how crazy or sharp AI takeoff will be, because GDP (as we actually compute it) systematically ignores anything that's revolutionized.

Replies from: johnswentworth, mark-xu
comment by johnswentworth · 2021-10-01T20:01:00.784Z · LW(p) · GW(p)

If you want a full post on this, upvote this comment.

Replies from: adamzerner, Raemon
comment by Adam Zerner (adamzerner) · 2021-10-01T20:59:32.308Z · LW(p) · GW(p)

In writing How much should we value life? [LW · GW], I spent some time digging into AI timeline stuff. It lead me to When Will AI Be Created?, written by Luke Muehlhauser for MIRI. He noted that there is reason not to trust expert opinions on AI timelines, and that trend extrapolation may be a good alternative. This point you're making about GDP seems like it is real progress towards coming up with a good way to do trend extrapolation, and thus seems worth a full post IMO. (Assuming it isn't already well known by the community or something, which I don't get the sense is the case.)

comment by Raemon · 2021-10-01T20:07:04.408Z · LW(p) · GW(p)

Upvoted, but I mostly trust you to write the post if it seems like there's an interesting meaty thing worth saying.

Replies from: johnswentworth
comment by johnswentworth · 2021-10-01T20:30:04.043Z · LW(p) · GW(p)

Eh, these were the main takeaways, the post would just be more details and examples so people can see the gears behind it.

comment by Mark Xu (mark-xu) · 2021-10-01T22:18:17.249Z · LW(p) · GW(p)

A similar point is made by Korinek in his review of Could Advanced AI Drive Explosive Economic Growth:

My first reaction to the framing of the paper is to ask: growth in what? It’s important to keep in mind that concepts like “gross domestic product” and “world gross domestic product” were defined from an explicit anthropocentric perspective - they measure the total production of final goods within a certain time period. Final goods are what is either consumed by humans (e.g. food or human services) or what is invested into “capital goods” that last for multiple periods (e.g. a server farm) to produce consumption goods for humans.

Now imagine you are a highly intelligent AI system running on the cloud. Although the production of the server farms on which you depend enters into human GDP (as a capital good), most of the things that you absorb, for example energy, server maintenance, etc., count as “intermediate goods” in our anthropocentric accounting systems and do not contribute to human GDP. In fact, to the extent that the AI system drives up the price of scarce resources (like energy) consumed by humans, real human GDP may even decline.

As a result, it is conceivable (and, to be honest, one of the central scenarios for me personally) that an AI take-off occurs but anthropocentric GDP measures show relative stagnation in the human economy.

To make this scenario a bit more tangible, consider the following analogy: imagine a world in which there are two islands trading with each other, but the inhabitants of the islands are very different from each other - let’s call them humans and AIs. The humans sell primitive goods like oil to the AIs and their level of technology is relatively stagnant. The AIs sell amazing services to the humans, and their level of technology doubles every year. However, the AI services that humans consume make up only a relatively small part of the human consumption basket. The humans are amazed at what fantastic services they get from the AIs in exchange for their oil, and they experience improvements in their standard of living from these fantastic AI services, although they also have to pay more and more for their energy use every year, which offsets part of that benefit. The humans can only see what’s happening on their own island and develop a measure of their own well-being that they call human GDP, which increases modestly because the advances only occur in a relatively small part of their consumption basket. The AIs can see what’s going on on the AI island and develop a measure of their own well-being which they call AI GDP, and which almost doubles every year. The system can go on like this indefinitely.

For a fuller discussion of these arguments, let me refer you to my working paper on “The Rise of Artificially Intelligent Agents” (with the caveat that the paper is still a working draft).

Replies from: mark-xu
comment by Mark Xu (mark-xu) · 2021-10-01T22:20:53.996Z · LW(p) · GW(p)

In general, Baumol type effects (spending decreasing in sectors where productivity goes up), mean that we can have scenarios in which the economy is growing extremely fast on "objective" metrics like energy consumption, but GDP has stagnated because that energy is being spent on extremely marginal increases in goods being bought and sold.

comment by johnswentworth · 2021-06-25T19:58:25.752Z · LW(p) · GW(p)

Chrome is offering to translate the LessWrong homepage for me. Apparently, it is in Greek.

Replies from: habryka4
comment by habryka (habryka4) · 2021-06-25T20:00:30.969Z · LW(p) · GW(p)

Huh, amusing. We do ship a font that has nothing but the greek letter set in it, because people use greek unicode symbols all the time and our primary font doesn't support that character set. So my guess is that's where Google gets confused.

Replies from: johnswentworth
comment by johnswentworth · 2021-06-25T20:55:16.539Z · LW(p) · GW(p)

Oh, I had just assumed it was commentary on the writing style/content.

Replies from: Viliam
comment by Viliam · 2021-06-25T21:11:25.274Z · LW(p) · GW(p)

If about 10% of articles have "Ω" in their title, what is the probability that the page is in Greek? :D

comment by johnswentworth · 2020-03-01T23:37:46.814Z · LW(p) · GW(p)

Someone should write a book review of The Design of Everyday Things aimed at LW readers, so I have a canonical source to link to other than the book itself.

comment by johnswentworth · 2023-09-13T21:17:24.918Z · LW(p) · GW(p)

Does anyone know of an "algebra for Bayes nets/causal diagrams"?

More specifics: rather than using a Bayes net to define a distribution, I want to use a Bayes net to state a property which a distribution satisfies. For instance, a distribution P[X, Y, Z] satisfies the diagram X -> Y -> Z if-and-only-if the distribution factors according to
P[X, Y, Z] = P[X] P[Y|X] P[Z|Y].

When using diagrams that way, it's natural to state a few properties in terms of diagrams, and then derive some other diagrams they imply. For instance, if a distribution P[W, X, Y, Z] satisfies all of:

  • W -> Y -> Z
  • W -> X -> Y
  • X -> (W, Y) -> Z

... then it also satisfies W -> X -> Y -> Z.

What I'm looking for is a set of rules for "combining diagrams" this way, without needing to go back to the underlying factorizations in order to prove things.

David and I have been doing this sort of thing a lot in our work the past few months, and it would be nice if someone else already had a nice write-up of the rules for it.

comment by johnswentworth · 2024-02-16T16:56:49.431Z · LW(p) · GW(p)

I keep seeing news outlets and the like say that SORA generates photorealistic videos, can model how things move in the real world, etc. This seems like blatant horseshit? Every single example I've seen looks like video game animation, not real-world video.

Have I just not seen the right examples, or is the hype in fact decoupled somewhat from the model's outputs?

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-02-16T17:26:04.412Z · LW(p) · GW(p)

I think I mildly disagree, but probably we're looking at the same examples.

I think the most impressive (in terms of realism) videos are under "Sora is able to generate complex scenes with multiple characters, ...". (Includes white SUV video and Toyko suburbs video.)

I think all of these videos other than the octopus and paper planes are "at-a-glance" photorealistic to me.

Overall, I think SORA can do "at-a-glance" photorealistic videos and can model to some extent how things move in the real world. I don't think it can do both complex motion and photorealism in the same video. As in, the videos which are photorealistic don't really involve complex motion and the videos which involve complex motion aren't photorealistic.

(So probably some amount of hype, but also pretty real?)

Replies from: habryka4, johnswentworth
comment by habryka (habryka4) · 2024-02-16T19:51:24.689Z · LW(p) · GW(p)

Hmm, I don't buy it. These two scenes seem very much not like the kind of thing a video game engine could produce: 

Look at this frame! I think there is something very slightly off about that face, but the cat hitting the person's face and the person's reaction seem very realistic to me and IMO qualifies as "complex motion and photorealism in the same video".

Replies from: johnswentworth, ryan_greenblatt, RamblinDash
comment by johnswentworth · 2024-02-17T00:33:26.491Z · LW(p) · GW(p)

Were these supposed to embed as videos? I just see stills, and don't know where they came from.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-02-17T00:34:07.307Z · LW(p) · GW(p)

These are stills from some of the videos I was referencing.

comment by ryan_greenblatt · 2024-02-16T22:07:55.926Z · LW(p) · GW(p)

TBC, I wasn't claiming anything about video game engines.

I wouldn't have called the cat one "complex motion", but I can see where you're coming from.

comment by RamblinDash · 2024-02-16T20:08:30.209Z · LW(p) · GW(p)

Yeah, I mean I guess it depends on what you mean by photorealistic. That cat has three front legs.

Replies from: gwern
comment by gwern · 2024-02-16T21:10:55.864Z · LW(p) · GW(p)

Yeah, this is the example I've been using to convince people that the game engines are almost certainly generating training data but are probably not involved at sampling time. I can't come up with any sort of hybrid architecture like 'NN controlling game-engine through API' where you get that third front leg. One of the biggest benefits of a game-engine would be ensuring exactly that wouldn't happen - body parts becoming detached and floating in mid-air and lack of conservation. If you had a game engine with a hyper-realistic cat body model in it which something external was manipulating, one of the biggest benefits is that you wouldn't have that sort of common-sense physics problem. (Meanwhile, it does look like past generative modeling of cats in its errors. Remember the ProGAN interpolation videos of CATS? Hilarious, but also an apt demonstration of how extremely hard cats are to model. They're worse than hands.)

In addition, you see plenty of classic NN tells throughout - note the people driving a 'Dandrover'...

comment by johnswentworth · 2024-02-16T17:47:27.533Z · LW(p) · GW(p)

Yeah, those were exactly the two videos which most made me think that the model was mostly trained on video game animation. In the tokyo one, the woman's facial muscles never move at all, even when the camera zooms in on her. And in the SUV one, the dust cloud isn't realistic, but even covering that up the SUV has a Grand Theft Auto look to its motion.

"Can't do both complex motion and photorealism in the same video" is a good hypothesis to track, thanks for putting that one on my radar.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2024-02-16T18:09:49.408Z · LW(p) · GW(p)

(Note that I was talking about the one with the train going through Toyko suburbs.)

comment by johnswentworth · 2023-08-01T04:11:37.894Z · LW(p) · GW(p)

Putting this here for posterity: I have thought since the superconductor preprint went up, and continue to think, that the markets are putting generally too little probability on the claims being basically-true. I thought ~70% after reading the preprint the day it went up (and bought up a market on manifold to ~60% based on that, though I soon regretted not waiting for a better price), and my probability has mostly been in the 40-70% range since then.

Replies from: johnswentworth
comment by johnswentworth · 2023-08-01T05:32:15.424Z · LW(p) · GW(p)

After seeing the markets jump up in response to the latest, I think I'm more like 65-80%.

comment by johnswentworth · 2022-05-29T00:27:20.149Z · LW(p) · GW(p)

Languages should have tenses for spacelike separation. My friend and I do something in parallel, it's ambiguous/irrelevant which one comes first, I want to say something like "I expect my friend <spacelike version of will do/has done/is doing> their task in such-and-such a way". 

Replies from: JBlack, adamShimi, kave
comment by JBlack · 2022-05-29T01:24:48.646Z · LW(p) · GW(p)

That sounds more like a tenseless sentence than using a spacelike separation tense. Your friend's performance of the task may well be in your future or past lightcone (or extend through both), but you don't wish to imply any of these.

There are languages with tenseless verbs, as well as some with various types of spatial tense.

The closest I can approximate this in English without clumsy constructs is "I expect my friend does their task in such-and-such a way", which I agree isn't very satisfactory.

comment by adamShimi · 2022-05-29T08:34:31.855Z · LW(p) · GW(p)

Who would have thought that someone would ever look at CSP and think "I want english to be more like that"?

Replies from: johnswentworth
comment by kave · 2022-05-29T00:43:25.372Z · LW(p) · GW(p)

Future perfect (hey, that's the name of the show!) seems like a reasonable hack for this in English

comment by johnswentworth · 2021-10-28T18:56:15.332Z · LW(p) · GW(p)

Two kinds of cascading catastrophes one could imagine in software systems...

  1. A codebase is such a spaghetti tower (and/or coding practices so bad) that fixing a bug introduces, on average, more than one new bug. Software engineers toil away fixing bugs, making the software steadily more buggy over time.
  2. Software services managed by different groups have dependencies - A calls B, B calls C, etc. Eventually, the dependence graph becomes connected enough and loopy enough that a sufficiently-large chunk going down brings down most of the rest, and nothing can go back up until everything else goes back up (i.e. there's circular dependence/deadlock).

How could we measure how "close" we are to one of these scenarios going supercritical?

For the first, we'd need to have attribution of bugs - i.e. track which change introduced each bug. Assuming most bugs are found and attributed after some reasonable amount of time, we can then estimate how many bugs each bug fix introduces, on average.

(I could also imagine a similar technique for e.g. medicine: check how many new problems result from each treatment of a problem.)

For the second, we'd need visibility into codebases maintained by different groups, which would be easy within a company but much harder across companies. In principle, within a company, some kind of static analysis tool could go look for all the calls to apis between services, map out the whole graph, and then calculate which "core" pieces could be involved in a catastrophic failure.

(Note that this problem could be mostly-avoided by intentionally taking down services occasionally, so engineers are forced to build around that possibility. I don't think any analogue of this approach would work for the first failure-type, though.)

comment by johnswentworth · 2021-05-21T04:03:36.818Z · LW(p) · GW(p)

I wish there were a fund roughly like the Long-Term Future Fund, but with an explicit mission of accelerating intellectual progress.

Replies from: habryka4, quinn-dougherty
comment by habryka (habryka4) · 2021-05-21T04:11:36.843Z · LW(p) · GW(p)

I mean, just to be clear, I am all in favor of intellectual progress. But doing so indiscriminately does sure seem a bit risky in this world of anthropogenic existential risks. Reminds me of my mixed feelings on the whole Progress Studies thing.

Replies from: johnswentworth
comment by johnswentworth · 2021-05-21T04:36:23.051Z · LW(p) · GW(p)

Yeah, I wouldn't want to accelerate e.g. black-box ML. I imagine the real utility of such a fund would be to experiment with ways to accelerate intellectual progress and gain understanding of the determinants, though the grant projects themselves would likely be more object-level than that. Ideally the grants would be in areas which are not themselves very risk-relevant, but complicated/poorly-understood enough to generate generalizable insights into progress.

I think it takes some pretty specific assumptions for such a thing to increase risk significantly on net. If we don't understand the determinants of intellectual progress, then we have very little ability to direct progress where we want it; it just follows whatever the local gradient is. With more understanding, at worst it follows the same gradient faster, and we end up in basically the same spot.

The one way it could net-increase risk is if the most likely path of intellectual progress leads to doom, and the best way to prevent doom is through some channel other than intellectual progress (like political action, for instance). Then accelerating the intellectual progress part potentially gives the other mechanisms (like political bodies) less time to react. Personally, though, I think a scenario in which e.g. political action successfully prevents intellectual progress from converging to doom (in a world where it otherwise would have) is vanishingly unlikely (like, less than one-in-a-hundred, maybe even less than one-in-a-thousand).

comment by Quinn (quinn-dougherty) · 2021-05-23T11:23:18.820Z · LW(p) · GW(p)

You might check out Donald Braben's view, it says "transformative research" (i.e. fundamental results that create new fields and industries) is critical for the survival of civilization. He does not worry that transformative results might end civilization.

comment by johnswentworth · 2022-10-24T20:51:36.059Z · LW(p) · GW(p)

Here's an interesting problem of embedded agency/True Names which I think would make a good practice problem: formulate what it means to "acquire" something (in the sense of "acquiring resources"), in an embedded/reductive sense. In other words, you should be able-in-principle to take some low-level world-model, and a pointer to some agenty subsystem in that world-model, and point to which things that subsystem "acquires" and when.

Some prototypical examples which an answer should be able to handle well:

  • Organisms (anything from bacteria to plant to animals) eating things, absorbing nutrients, etc.
  • Humans making money or gaining property.
Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2022-10-28T12:57:08.421Z · LW(p) · GW(p)

...and how the brain figures this out and why it is motivated to do so. There are a lot of simple animals that apparently "try to control" resources or territory. How? 

Drives to control resources occur everywhere. And your control of resources is closely related to your dominance in a dominance hierarchy. Which seems to be regulated in many animals by serotonin. See e.g. https://www.nature.com/articles/s41386-022-01378-2 

comment by johnswentworth · 2020-02-27T19:04:55.439Z · LW(p) · GW(p)

What if physics equations were written like statically-typed programming languages?

Replies from: jimrandomh, steve2152
comment by jimrandomh · 2020-02-27T19:54:25.639Z · LW(p) · GW(p)

The math and physics worlds still use single-letter variable names for everything, decades after the software world realized that was extremely bad practice. This makes me pessimistic about the adoption of better notation practices.

Replies from: johnswentworth
comment by johnswentworth · 2020-02-27T23:10:11.722Z · LW(p) · GW(p)

Better? I doubt it. If physicists wrote equations the way programmers write code, a simple homework problem would easily fill ten pages.

Verboseness works for programmers because programmers rarely need to do anything more complicated with their code than run it - analogous to evaluating an expression, for a physicist or mathematician. Imagine if you needed to prove one program equivalent to another algebraically - i.e. a sequence of small transformations, with a record of intermediate programs derived along the way in order to show your work. I expect programmers subjected to such a use-case would quickly learn the virtues of brevity.

comment by Steven Byrnes (steve2152) · 2020-02-27T19:26:49.729Z · LW(p) · GW(p)

Yeah, I'm apparently not intelligent enough to do error-free physics/engineering calculations without relying on dimensional analysis as a debugging tool. I even came up with a weird, hack-y way to do that in computing environments like Excel and Cython, where flexible multiplicative types are not supported.

comment by johnswentworth · 2023-04-18T20:31:44.947Z · LW(p) · GW(p)

An interesting conundrum: one of the main challenges of designing useful regulation for AI is that we don't have any cheap and robust way to distinguish a dangerous neural net from a non-dangerous net (or, more generally, a dangerous program from a non-dangerous program). This is an area where technical research could, in principle, help a lot.

The problem is, if there were some robust metric for how dangerous a net is, and that metric were widely known and recognized (as it would probably need to be in order to be used for regulatory purposes), then someone would probably train a net to maximize that metric directly.

Replies from: D0TheMath, Thane Ruthenis, Thane Ruthenis
comment by Garrett Baker (D0TheMath) · 2023-04-18T23:00:02.828Z · LW(p) · GW(p)

This seems to lead to the solution of trying to make your metric one-way, in the sense that your metric should

  1. Provide an upper-bound on the dangerousness of your network

  2. Compress the space of networks which map to approximately the same dangerousness level on the low end of dangerousness, and expand the space of networks which map to approximately the same dangerousness level on the upper end of dangerous, so that you can train your network to minimize the metric, but when you train your network to maximize the metric you end up in a degenerate are with technically very high measured danger levels but in actuality very low levels of dangerousness.

We can hope (or possibly prove) that as you optimize upwards on the metric you get subject to goodheart's curse, but the opposite occurs on the lower end.

comment by Thane Ruthenis · 2023-04-18T20:56:03.697Z · LW(p) · GW(p)

Sure, even seems a bit tautological: any such metric, to be robust, would need to contain in itself a definition of a dangerously-capable AI, so you probably wouldn't even need to train a model to maximize it. You'd be able to just lift the design from the metric directly.

comment by Thane Ruthenis · 2023-05-14T01:47:37.720Z · LW(p) · GW(p)

Do you have any thoughts on a softer version of this problem, where the metric can't be maximized directly, but gives a concrete idea of what sort of challenge your AI needs to beat to qualify as AGI? (And therefore in which direction in the architectural-design-space you should be moving.)

Some variation on this [LW(p) · GW(p)] seems like it might work as a "fire alarm" test set, but as you point out, inasmuch as it's recognized, it'll be misapplied for benchmarking instead.

(I suppose the ideal way to do it would be to hand it off to e. g. ARC, so they can use it if OpenAI invites them for safety-testing again. This way, SOTA models still get tested, but the actors who might misuse it aren't aware of the testing's particulars until they succeed anyway...)

comment by johnswentworth · 2021-09-28T04:46:02.895Z · LW(p) · GW(p)

I just went looking for a good reference for the Kelly criterion, and didn't find any on Lesswrong. So, for anybody who's looking: chapter 6 of Thomas & Cover's textbook on information theory is the best source I currently know of.

Replies from: Yoav Ravid
comment by Yoav Ravid · 2021-09-28T04:54:05.401Z · LW(p) · GW(p)

Might be a good thing to add to the Kelly Criterion tag [? · GW]

comment by johnswentworth · 2020-10-26T23:33:28.844Z · LW(p) · GW(p)

Neat problem of the week: we have n discrete random variables, . Given any variable, all variables are independent:

Characterize the distributions which satisfy this requirement.

This problem came up while working on the theorem in this post [LW · GW], and (separately) in the ideas behind this post [? · GW]. Note that those posts may contain some spoilers for the problem, though frankly my own proofs on this one just aren't very good.

comment by johnswentworth · 2020-03-30T20:47:16.835Z · LW(p) · GW(p)

For short-term, individual cost/benefit calculations around C19, it seems like uncertainty in the number of people currently infected should drop out of the calculation.

For instance: suppose I'm thinking about the risk associated with talking to a random stranger, e.g. a cashier. My estimated chance of catching C19 from this encounter will be roughly proportional to . But, assuming we already have reasonably good data on number hospitalized/died, my chances of hospitalization/death given infection will be roughly inversely proportional to . So, multiplying those two together, I'll get a number roughly independent of .

How general is this? Does some version of it apply to long-term scenarios too (possibly accounting for herd immunity)? What short-term decisions do depend on ?