Posts

How should I behave ≥14 days after my first mRNA vaccine dose but before my second dose? 2021-04-08T06:24:28.213Z

Comments

Comment by dsj on Book Review: 1948 by Benny Morris · 2023-12-03T22:14:46.759Z · LW · GW

Thank you, I appreciated this post quite a bit. There's a paucity of historical information about this conflict which isn't colored by partisan framing, and you seem to be coming from a place of skeptical, honest inquiry. I'd look forward to reading what you have to say about 1967.

Comment by dsj on Anthropic Fall 2023 Debate Progress Update · 2023-11-29T20:59:08.870Z · LW · GW

Thanks for doing this! I think a lot of people would be very interested in the debate transcripts if you posted them on GitHub or something.

Comment by dsj on Evaluating the historical value misspecification argument · 2023-10-08T16:45:32.801Z · LW · GW

Okay. I do agree that one way to frame Matthew’s main point is that MIRI thought it would be hard to specify the human value function, and an LM that understands human values and reliably tells us the truth about that understanding is such a specification, and hence falsifies that belief.

To your second question: MIRI thought we couldn’t specify the value function to do the bounded task of filling the cauldron, because any value function we could naively think of writing, when given to an AGI (which was assumed to be a utility argmaxer), leads to all sorts of instrumentally convergent behavior such as taking over the world to make damn sure the cauldron is really filled, since we forgot all the hidden complexity of our wish.

Comment by dsj on Evaluating the historical value misspecification argument · 2023-10-08T07:24:07.884Z · LW · GW

I think this reply is mostly talking past my comment.

I know that MIRI wasn't claiming we didn't know how to safely make deep learning systems, GOFAI systems, or what-have-you fill buckets of water, but my comment wasn't about those systems. I also know that MIRI wasn't issuing a water-bucket-filling challenge to capabilities researchers.

My comment was specifically about directing an AGI (which I think GPT-4 roughly is), not deep learning systems or other software generally. I *do* think MIRI was claiming we didn't know how to make AGI systems safely do mundane tasks.

I think some of Nate's qualifications are mainly about the distinction between AGI and other software, and others (such as "[i]f the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes") mostly serve to illustrate the conceptual frame MIRI was (and largely still is) stuck in about how an AGI would work: an argmaxer over expected utility.

[Edited to add: I'm pretty sure GPT-4 is smart enough to know the consequences of its being shut down, and yet dumb enough that, if it really wanted to prevent that from one day happening, we'd know by now from various incompetent takeover attempts.]

Comment by dsj on Evaluating the historical value misspecification argument · 2023-10-08T02:21:57.443Z · LW · GW

Okay, that clears things up a bit, thanks. :) (And sorry for delayed reply. Was stuck in family functions for a couple days.)

This framing feels a bit wrong/confusing for several reasons.

  1. I guess by “lie to us” you mean act nice on the training distribution, waiting for a chance to take over the world while off distribution. I just … don’t believe GPT-4 is doing this; it seems highly implausible to me, in large part because I don’t think GPT-4 is clever enough that it could keep up the veneer until it’s ready to strike if that were the case.

  2. The term “lie to us” suggests all GPT-4 does is say things, and we don’t know how it’ll “behave” when we finally trust it and give it some ability to act. But it only “says things” in the same sense that our brain only “emits information”. GPT-4 is now hooked up to web searches, code writing, etc. But maybe I misunderstand the sense in which you think GPT-4 is lying to us?

  3. I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.

(If I’ve misunderstood your point, sorry! Please feel free to clarify and I’ll try to engage with what you actually meant.)

Comment by dsj on Evaluating the historical value misspecification argument · 2023-10-06T15:30:34.883Z · LW · GW

Hmm, you say “your claim, if I understand correctly, is that MIRI thought AI wouldn't understand human values”. I’m disagreeing with this. I think Matthew isn’t claiming that MIRI thought AI wouldn’t understand human values.

Comment by dsj on Evaluating the historical value misspecification argument · 2023-10-06T15:11:25.990Z · LW · GW

I think you’re misunderstanding the paragraph you’re quoting. I read Matthew, in that paragraph as acknowledging the difference between the two problems, and saying that MIRI thought value specification (not value understanding) was much harder than it’s looking to actually be.

Comment by dsj on A transcript of the TED talk by Eliezer Yudkowsky · 2023-07-16T05:04:31.808Z · LW · GW

I know this is from a bit ago now so maybe he’s changed his tune since, but I really wish he and others would stop repeating the falsehood that all international treaties are ultimately backed by force on the signatory countries. There are countless trade, climate reduction, and nuclear disarmament agreements which are not backed by force. I’d venture to say that the large majority of agreements are backed merely by the promise of continued good relations and tit-for-tat mutual benefit or defection.

Comment by dsj on Deep learning models might be secretly (almost) linear · 2023-04-24T21:32:04.566Z · LW · GW

A key distinction is between linearity in the weights vs. linearity in the input data.

For example, the function  is linear in the arguments  and  but nonlinear in the arguments  and , since  and  are nonlinear.

Similarly, we have evidence that wide neural networks  are (almost) linear in the parameters , despite being nonlinear in the input data  (due e.g. to nonlinear activation functions such as ReLU). So nonlinear activation functions are not a counterargument to the idea of linearity with respect to the parameters.

If this is so, then neural networks are almost a type of kernel machine, doing linear learning in a space of features which are themselves a fixed nonlinear function of the input data.

Comment by dsj on Can we evaluate the "tool versus agent" AGI prediction? · 2023-04-09T00:45:01.886Z · LW · GW

The more I stare at this observation, the more it feels potentially more profound than I intended when writing it.

Consider the “cauldron-filling” task. Does anyone doubt that, with at most a few very incremental technological steps from today, one could train a multimodal, embodied large language model (“RobotGPT”), to which you could say, “please fill up the cauldron”, and it would just do it, using a reasonable amount of common sense in the process — not flooding the room, not killing anyone or going to any other extreme lengths, and stopping if asked? Isn’t this basically how ChatGPT behaves now when you ask it for most things, bringing to bear a great deal of common sense in its understanding of your request, and avoiding overly-literal interpretations which aren’t what you really want?

Compare that to Nate’s 2017 description of the fiendish difficulty of this problem:

Things go very badly for Mickey

Why would we expect a generally intelligent system executing the above program [sorta-argmaxing over probability that the cauldron is full] to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?

The first difficulty is that the objective function that Mickey gave his broom left out a bunch of other terms Mickey cares about:

The second difficulty is that Mickey programmed the broom to make the expectation of its score as large as it could. “Just fill one cauldron with water” looks like a modest, limited-scope goal, but when we translate this goal into a probabilistic context, we find that optimizing it means driving up the probability of success to absurd heights.

Regarding off switches:

If the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes, then the system's incentive is to subvert shutdown attempts.

… We need to figure out how to formally specify objective functions that don't automatically place the AI system into an adversarial context with the operators; or we need to figure out some way to have the system achieve goals without optimizing some objective function in the traditional sense.

… What we want is a way to combine two objective functions — a default function for normal operation, and a suspend function for when we want to suspend the system to disk.

… We want our method for combining the functions to satisfy three conditions: an operator should be able to switch between the functions (say, by pushing a button); the system shouldn't have any incentives to control which function is active; and if it's plausible that the system's normal operations could inadvertently compromise our ability to switch between the functions, then the system should be incentivized to keep that from happening.

So far, we haven't found any way to achieve all three goals at once.

These all seem like great arguments that we should not build and run a utility maximizer with some hand-crafted goal, and indeed RobotGPT isn’t any such thing. The contrast between this story and where we seem to be heading seems pretty stark to me. (Obviously it’s a fictional story, but Nate did say “as fictional depictions of AI go, this is pretty realistic”, and I think it does capture the spirit of much actual AI alignment research.)

Perhaps one could say that these sorts of problems only arise with superintelligent agents, not agents at ~GPT-4 level. I grant that the specific failure modes available to a system will depend on its capability level, but the story is about the difficulty of pointing a “generally intelligent system” to any common sense goal at all. If the story were basically right, GPT-4 should already have lots of “dumb-looking” failure modes today due to taking instructions too literally. But mostly it has pretty decent common sense.

Certainly, valid concerns remain about instrumental power-seeking, deceptive alignment, and so on, so I don’t say this means we should be complacent about alignment, but it should probably give us some pause that the situation is this different in practice from how it was envisioned only six years ago in the worldview represented in that story.

Comment by dsj on Can we evaluate the "tool versus agent" AGI prediction? · 2023-04-08T22:09:19.593Z · LW · GW

Though interestingly, aligning a langchainesque AI to the user’s intent seems to be (with some caveats) roughly as hard as stating that intent in plain English.

Comment by dsj on Anthropic is further accelerating the Arms Race? · 2023-04-07T01:59:26.412Z · LW · GW

My guess is “today” was supposed to refer to some date when they were doing the investigation prior to the release of GPT-4, not the date the article was published.

Comment by dsj on Anthropic is further accelerating the Arms Race? · 2023-04-07T01:58:15.761Z · LW · GW

Got a source for this estimate?

Comment by dsj on No Summer Harvest: Why AI Development Won't Pause · 2023-04-06T05:53:59.215Z · LW · GW

Nitpick: the paper from Eloundou et al is called “GPTs are GPTs”, not “GPTs and GPTs”.

Comment by dsj on Reward is not the optimization target · 2023-04-03T18:35:38.989Z · LW · GW

Probably I should get around to reading CAIS, given that it made these points well before I did.

I found it's a pretty quick read, because the hierarchical/summary/bullet point layout allows one to skip a lot of the bits that are obvious or don't require further elaboration (which is how he endorsed reading it in this lecture).

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-03T15:48:36.834Z · LW · GW

We don’t know with confidence how hard alignment is, and whether something roughly like the current trajectory (even if reckless) leads to certain death if it reaches superintelligence.

There is a wide range of opinion on this subject from smart, well-informed people who have devoted themselves to studying it. We have a lot of blog posts and a small number of technical papers, all usually making important (and sometimes implicit and unexamined) theoretical assumptions which we don’t know are true, plus some empirical analysis of much weaker systems.

We do not have an established, well-tested scientific theory like we do with pathogens such as smallpox. We cannot say with confidence what is going to happen.

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-03T09:34:33.349Z · LW · GW

I agree that if you're absolutely certain AGI means the death of everything, then nuclear devastation is preferable.

I think the absolute certainty that AGI does mean the death of everything is extremely far from called for, and is itself a bit scandalous.

(As to whether Eliezer's policy proposal is likely to lead to nuclear devastation, my bottom line view is it's too vague to have an opinion. But I think he should have consulted with actual AI policy experts and developed a detailed proposal with them, which he could then point to, before writing up an emotional appeal, with vague references to air strikes and nuclear conflict, for millions of lay people to read in TIME Magazine.)

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-03T07:30:25.616Z · LW · GW

One's credibility would be less of course, but Eliezer is not the one who would be implementing the hypothetical policy (that would be various governments), so it's not his credibility that's relevant here.

I don't have much sense he's holding back his real views on the matter.

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-03T06:37:00.274Z · LW · GW

Did you intend to say risk off, or risk of?

If the former, then I don't understand your comment and maybe a rewording would help me.

If the latter, then I'll just reiterate that I'm referring to Eliezer's explicitly stated willingness to trade off the actuality of (not just some risk of) nuclear devastation to prevent the creation of AGI (though again, to be clear, I am not claiming he advocated a nuclear first strike). The only potential uncertainty in that tradeoff is the consequences of AGI (though I think Eliezer's been clear that he thinks it means certain doom), and I suppose what follows after nuclear devastation as well.

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-03T05:20:38.025Z · LW · GW

Right, but of course the absolute, certain implication from “AGI is created” to “all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals” requires some amount of justification, and that justification for this level of certainty is completely missing.

In general such confidently made predictions about the technological future have a poor historical track record, and there are multiple holes in the Eliezer/MIRI story, and there is no formal, canonical write up of why they’re so confident in their apparently secret knowledge. There’s a lot of informal, non-canonical, nontechnical stuff like List of Lethalities, security mindset, etc. that’s kind of gesturing at ideas, but there are too many holes and potential objections to have their claimed level of confidence, and they haven’t published anything formal since 2021, and very little since 2017.

We need more than that if we’re going to confidently prefer nuclear devastation over AGI.

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-02T22:34:01.544Z · LW · GW

There’s a big difference between pre-committing to X so you have a credible threat against Y, vs. just outright preferring X over Y. In the quoted comment, Eliezer seems to have been doing the latter.

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-02T15:05:07.390Z · LW · GW

I don’t agree billions dead is the only realistic outcome of his proposal. Plausibly it could just result in actually stopping large training runs. But I think he’s too willing to risk billions dead to achieve that.

Comment by dsj on Policy discussions follow strong contextualizing norms · 2023-04-02T08:53:45.791Z · LW · GW

In response to the question,

“[Y]ou’ve gestured at nuclear risk. … How many people are allowed to die to prevent AGI?”,

he wrote:

“There should be enough survivors on Earth in close contact to form a viable reproductive population, with room to spare, and they should have a sustainable food supply. So long as that's true, there's still a chance of reaching the stars someday.”

He later deleted that tweet because he worried it would be interpreted by some as advocating a nuclear first strike.

I’ve seen no evidence that he is advocating a nuclear first strike, but it does seem to me to be a fair reading of that tweet that he would trade nuclear devastation for preventing AGI.

Comment by dsj on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-04-02T04:59:06.411Z · LW · GW

If there were a game-theoretically reliable way to get everyone to pause all together, I'd support it.

Comment by dsj on AI and Evolution · 2023-03-31T09:12:02.865Z · LW · GW

In §3.1–3.3, you look at the main known ways that altruism between humans has evolved — direct and indirect reciprocity, as well as kin and group selection[1] — and ask whether we expect such altruism from AI towards humans to be similarly adaptive.

However, as observed in R. Joyce (2007). The Evolution of Morality (p. 5),

Evolutionary psychology does not claim that observable human behavior is adaptive, but rather that it is produced by psychological mechanisms that are adaptations. The output of an adaptation need not be adaptive.

This is a subtle distinction which demands careful inspection.

In particular, are there circumstances under which AI training procedures and/or market or evolutionary incentives may produce psychological mechanisms which lead to altruistic behavior towards human beings, even when that altruistic behavior is not adaptive? For example, could altruism learned towards human beings early on, when humans have something to offer in return, be “sticky” later on (perhaps via durable, self-perpetuating power structures), when humans have nothing useful to offer? Or could learned altruism towards other AIs be broadly-scoped enough that it applies to humans as well, just as human altruistic tendencies sometimes apply to animal species which can offer us no plausible reciprocal gain? This latter case is analogous to the situation analyzed in your paper, and yet somehow a different result has (sometimes) occurred in reality than that predicted by your analysis.[2]

I don’t claim the conclusion is wrong, but I think a closer look at this subtlety would give the arguments for it more force.

  1. ^

    Although you don’t look at network reciprocity / spatial selection.

  2. ^

    Even factory farming, which might seem like a counterexample, is not really. For the very existence of humans altruistically motivated to eliminate it — and who have a real shot at success — demands explanation under your analysis.

Comment by dsj on Reward is not the optimization target · 2023-03-31T04:45:22.778Z · LW · GW

A similar point is (briefly) made in K. E. Drexler (2019). Reframing Superintelligence: Comprehensive AI Services as General Intelligence, §18 “Reinforcement learning systems are not equivalent to reward-seeking agents”:

Reward-seeking reinforcement-learning agents can in some instances serve as models of utility-maximizing, self-modifying agents, but in current practice, RL systems are typically distinct from the agents they produce … In multi-task RL systems, for example, RL “rewards” serve not as sources of value to agents, but as signals that guide training[.]

And an additional point which calls into question the view of RL-produced agents as the product of one big training run (whose reward specification we better get right on the first try), as opposed to the product of an R&D feedback loop with reward as one non-static component:

RL systems per se are not reward-seekers (instead, they provide rewards), but are instead running instances of algorithms that can be seen as evolving in competition with others, with implementations subject to variation and selection by developers. Thus, in current RL practice, developers, RL systems, and agents have distinct purposes and roles.

RL algorithms have improved over time, not in response to RL rewards, but through research and development. If we adopt an agent-like perspective, RL algorithms can be viewed as competing in an evolutionary process where success or failure (being retained, modified, discarded, or published) depends on developers’ approval (not “reward”), which will consider not only current performance, but also assessed novelty and promise.

Comment by dsj on "Dangers of AI and the End of Human Civilization" Yudkowsky on Lex Fridman · 2023-03-31T01:49:15.927Z · LW · GW

So the the way that you are like taking what is probably basically the same architecture in GPT-3 and throwing 20 times as much compute at it, probably, and getting out GPT-4.

Indeed, GPT-3 is almost exactly the same architecture as GPT-2, and only a little different from GPT.

Comment by dsj on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T11:23:19.231Z · LW · GW

X-risks tend to be more complicated beasts than lions in bushes, in that successfully avoiding them requires a lot more than reflexive action: we’re not going to navigate them by avoiding carefully understanding them.

Comment by dsj on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T03:51:27.589Z · LW · GW

Thanks, I appreciate the spirit with which you've approached the conversation. It's an emotional topic for people I guess.

Comment by dsj on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T03:04:28.837Z · LW · GW

The negation of the claim would not be "There is definitely nothing to worry about re AI x-risk." It would be something much more mundane-sounding, like "It's not the case that if we go ahead with building AGI soon, we all die."

I debated with myself whether to present the hypothetical that way. I chose not to, because of Eliezer's recent history of extremely confident statements on the subject. I grant that the statement I quoted in isolation could be interpreted more mundanely, like the example you give here.

When the stakes are this high and the policy proposals are such as in this article, I think clarity about how confident you are isn't optional. I would also take issue with the mundanely phrased version of the negation.

(For context, I'm working full-time on AI x-risk, so if I were going to apply a double-standard, it wouldn't be in favor of people with a tendency to dismiss it as a concern.)

Comment by dsj on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T02:54:00.129Z · LW · GW

Would you say the same thing about the negations of that claim? If you saw e.g. various tech companies and politicians talking about how they're going to build AGI and then [something that implies that people will still be alive afterwards] would you call them out and say they need to qualify their claim with uncertainty or else they are being unreasonable?

Yes, I do in fact say the same thing to professions of absolute certainty that there is nothing to worry about re: AI x-risk.

Comment by dsj on Pausing AI Developments Isn't Enough. We Need to Shut it All Down by Eliezer Yudkowsky · 2023-03-30T02:05:02.738Z · LW · GW

There simply don't exist arguments with the level of rigor needed to justify a claim such as this one without any accompanying uncertainty:

If we go ahead on this everyone will die, including children who did not choose this and did not do anything wrong.

I think this passage, meanwhile, rather misrepresents the situation to a typical reader:

When the insider conversation is about the grief of seeing your daughter lose her first tooth, and thinking she’s not going to get a chance to grow up, I believe we are past the point of playing political chess about a six-month moratorium.

This isn't "the insider conversation". It's (the partner of) one particular insider, who exists on the absolute extreme end of what insiders think, especially if we restrict ourselves to those actively engaged with research in the last several years. A typical reader could easily come away from that passage thinking otherwise.

Comment by dsj on FLI open letter: Pause giant AI experiments · 2023-03-29T21:23:43.093Z · LW · GW

A somewhat reliable source has told me that they don't have the compute infrastructure to support making a more advanced model available to users.

That might also reflect limited engineering efforts to optimize state-of-the-art models for real world usage (think of the performance gains from GPT-3.5 Turbo) as opposed to hitting benchmarks for a paper to be published.

Comment by dsj on FLI open letter: Pause giant AI experiments · 2023-03-29T21:18:12.382Z · LW · GW

I believe Anthropic is committed to not pushing at the state-of-the-art, so they may not be the most relevant player in discussions of race dynamics.

Comment by dsj on FLI open letter: Pause giant AI experiments · 2023-03-29T19:41:28.909Z · LW · GW

Yes, although the chat interface was necessary but insufficient. They also needed a capable language model behind it, which OpenAI already had, and Google still lacks months later.

Comment by dsj on FLI open letter: Pause giant AI experiments · 2023-03-29T18:08:19.003Z · LW · GW

I agree that those are possibilities.

On the other hand, why did news reports[1] suggest that Google was caught flat-footed by ChatGPT and re-oriented to rush Bard to market?

My sense is that Google/DeepMind's lethargy in the area of language models is due to a combination of a few factors:

  1. They've diversified their bets to include things like protein folding, fusion plasma control, etc. which are more application-driven and not on an AGI path.
  2. They've focused more on fundamental research and less on productizing and scaling.
  3. Their language model experts might have a somewhat high annual attrition rate.
    1. I just looked up the authors on Google Brain's Attention is All You Need, and all but one have left Google after 5.25 years, many for startups, and one for OpenAI. That works out to an annual attrition of 33%.
    2. For DeepMind's Chinchilla paper, 6 of 22 researchers have been lost in 1 year: 4 to OpenAI and 2 to startups. That's 27% annual attrition.
    3. By contrast, 16 or 17 of the 30 authors on the GPT-3 paper seem to still be at OpenAI, 2.75 years later, which works out to 20% annual attrition. Notably, of those who have left, not a one has left for Google or DeepMind, though interestingly, 8 have left for Anthropic. (Admittedly, this somewhat reflects the relative newness and growth rates of Google/DeepMind, OpenAI, and Anthropic, since a priori we expect more migration from slow-growing orgs to fast-growing orgs than vice versa.)
  4. It's broadly reported that Google as an organization struggles with stifling bureaucracy and a lack of urgency. (This was also my observation working there more than ten years ago, and I expect it's gotten worse since.)
  1. ^

    e.g. this from the New York Times

Comment by dsj on FLI open letter: Pause giant AI experiments · 2023-03-29T07:38:53.062Z · LW · GW

If a major fraction of all resources at the top 5–10 labs were reallocated to "us[ing] this pause to jointly develop and implement a set of shared safety protocols", that seems like it would be a good thing to me.

However, the letter offers no guidance as to what fraction of resources to dedicate to this joint safety work. Thus, we can expect that DeepMind and others might each devote a couple teams to that effort, but probably not substantially halt progress at their capabilities frontier.

The only player who is effectively being asked to halt progress at its capabilities frontier is OpenAI, and that seems dangerous to me for the reasons I stated above.

Comment by dsj on FLI open letter: Pause giant AI experiments · 2023-03-29T07:16:56.008Z · LW · GW

Currently, OpenAI has a clear lead over its competitors.[1] This is arguably the safest arrangement as far as race dynamics go, because it gives OpenAI some breathing room in case they ever need to slow down later on for safety reasons, and also because their competitors don't necessarily have a strong reason to think they can easily sprint to catch up.

So far as I can tell, this petition would just be asking OpenAI to burn six months of that lead and let other players catch up. That might create a very dangerous race dynamic, where now you have multiple players neck-and-neck, each with a credible claim to have a chance to get into the lead.

(And I'll add: while OpenAI has certainly made decisions I disagree with, at least they actively acknowledge existential safety concerns and have a safety plan and research agenda. I'd much rather they be in the lead than Meta, Baidu, the Chinese government, etc., all of whom to my knowledge have almost no active safety research and in some cases are actively dismissive of the need for such.)

  1. ^

    One might have considered Google/DeepMind to be OpenAI's peer, but after the release of Bard — which is substantially behind GPT-3.5 capabilities, never mind GPT-4 — I think this is a hard view to hold.

Comment by dsj on Geoffrey Hinton - Full "not inconceivable" quote · 2023-03-28T07:59:07.858Z · LW · GW

Another interesting section:

Silva-Braga: Are we close to the computers coming up with their own ideas for improving themselves?

Hinton: Uhm, yes, we might be.

Silva-Braga: And then it could just go fast?

Hinton: That's an issue, right. We have to think hard about how to control that.

Silva-Braga: Yeah. Can we?

Hinton: We don't know. We haven't been there yet, but we can try.

Silva-Braga: Okay. That seems kind of concerning.

Hinton: Uhm, yes.

Comment by dsj on Metaculus Predicts Weak AGI in 2 Years and AGI in 10 · 2023-03-26T12:38:35.541Z · LW · GW

Of course the choice of what sort of model we fit to our data can sometimes preordain the conclusion.

Another way to interpret this is there was a very steep update made by the community in early 2022, and since then it’s been relatively flat, or perhaps trending down slowly with a lot of noise (whereas before the update it was trending up slowly).

Comment by dsj on Metaculus Predicts Weak AGI in 2 Years and AGI in 10 · 2023-03-26T02:56:15.386Z · LW · GW

Seems to me there's too much noise to pinpoint the break at a specific month. There are some predictions made in early 2022 with an even later date than those made in late 2021.

But one pivotal thing around that time might have been the chain-of-thought stuff which started to come to attention then (even though there was some stuff floating around Twitter earlier).

Comment by dsj on Microsoft Research Paper Claims Sparks of Artificial Intelligence in GPT-4 · 2023-03-24T20:34:16.669Z · LW · GW

It's a terribly organized and presented proof, but I think it's basically right (although it's skipping over some algebraic details, which is common in proofs). To spell it out:

Fix any  and . We then have,

.

Adding  to both sides,

.

Therefore, if (by assumption in that line of the proof)  and , we'd have,

,

which contradicts our assumption that .

Comment by dsj on Empirical risk minimization is fundamentally confused · 2023-03-23T19:17:26.407Z · LW · GW

Thanks! This is clearer. (To be pedantic, the  distance should have a  root around the integral, but it's clear what you mean.)

Comment by dsj on Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research · 2023-03-23T07:32:40.796Z · LW · GW

Perhaps of interest to this community is GPT-4 using a Linux terminal to iteratively problem-solve locating and infiltrating a poorly-secured machine on a local network:

Comment by dsj on Empirical risk minimization is fundamentally confused · 2023-03-23T00:51:58.492Z · LW · GW

That's because the naive inner product suggested by the risk is non-informative,

Hmm, how is this an inner product? I note that it lacks, among other properties, positive definiteness:

Edit: I guess you mean a distance metric induced by an inner product (similar to the examples later on, where you have distance metrics induced by a norm), not an actual inner product? I'm confused by the use of standard inner product notation if that's the intended meaning. Also, in this case, this doesn't seem to be a valid distance metric either, as it lacks symmetry and non-negativity. So I think I'm still confused as to what this is saying.

Comment by dsj on A tension between two prosaic alignment subgoals · 2023-03-20T02:36:01.274Z · LW · GW

I think this is overcomplicating things.

We don't have to solve any deep philosophical problems here finding the one true pointer to "society's values", or figuring out how to analogize society to an individual.

We just have to recognize that the vast majority of us really don't want a single rogue to be able to destroy everything we've all built, and we can act pragmatically to make that less likely.

Comment by dsj on A tension between two prosaic alignment subgoals · 2023-03-19T15:54:01.755Z · LW · GW

Well, admittedly “alignment to humanity as a whole” is open to interpretation. But would you rather everyone have their own personal superintelligence that they can brainwash to do whatever they want?

Comment by dsj on A tension between two prosaic alignment subgoals · 2023-03-19T15:20:24.520Z · LW · GW

I agree this is a problem, and furthermore it’s a subcase of a more general problem: alignment to the user’s intent may frequently entail misalignment with humanity as a whole.

Comment by dsj on A shot at the diamond-alignment problem · 2023-03-12T22:43:26.836Z · LW · GW

I think that was supposed to be answered by this line:

After each task completion, the agent gets to be near some diamond and receives reward.

Comment by dsj on The shard theory of human values · 2023-03-10T07:26:52.040Z · LW · GW

Ever since the discovery that the mammalian dopamine system implements temporal difference learning of reward prediction error, a longstanding question for those seeking a satisfying computational account of subjective experience has been: what is the relationship between happiness and reward (or reward prediction error)? Are they the same thing?

Or if not, is there some other natural correspondence between our intuitive notion of “being happy” and some identifiable computational entity in a reinforcement learning agent?

A simple reflection shows that happiness is not identical to reward prediction error: If I’m on a long, tiring journey of predictable duration, I still find relief at the moment I reach my destination. This is true even for journeys I’ve taken many times before, so that there can be little question that my unconscious has had opportunity to learn the predicted arrival time, and this isn’t just a matter of my conscious predictions getting ahead of my unconscious ones.

On the other hand, I also gain happiness from learning, well before I arrive, that traffic on my route has dissipated. So there does seem to be some amount of satisfaction gained just from learning new information, even prior to “cashing it in”. Hence, happiness is not identical to simple reward either.

Perhaps shard theory can offer a straightforward answer here: happiness (respectively suffering) is when a realized feature of the agent’s world model corresponds to something that a shard which is currently active values (respectively devalues).

If this is correct, then happiness, like value, is not a primitive concept like reward (or reward prediction error), but instead relies on at least having a proto-world model.

It also explains the experience some have had, achieved through the use of meditation or other deliberate effort, of bodily pain without attendant suffering. They are presumably finding ways to activate shards that simply do not place negative value on pain.

Finally: happiness is then not a unidimensional, inter-comparable thing, but instead each type is to an extent sui generis. This comports with my intuition: I have no real scale on which I can weigh the pleasure of an orgasm against the delight of mathematical discovery.