Posts

Public Opinion on AI Safety: AIMS 2023 and 2021 Summary 2023-09-25T18:55:41.532Z
AGI goal space is big, but narrowing might not be as hard as it seems. 2023-04-12T19:03:26.701Z
Natural language alignment 2023-04-12T19:02:23.162Z
Key Questions for Digital Minds 2023-03-22T17:13:49.270Z
Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue) 2022-11-22T16:50:20.054Z
Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight 2022-11-16T13:54:15.850Z

Comments

Comment by Jacy Reese Anthis (Jacy Reese) on Announcing Apollo Research · 2023-05-31T16:24:09.166Z · LW · GW

This is a very exciting project! I'm particularly glad to see two features: (i) the focus on "deception", which undergirds much existential risk but has arguably been less of a focal point than "agency", "optimization", "inner misalignment", and other related concepts, (ii) the ability to widen the bottleneck of upskilling novice AI safety researchers who have, say, 500 hours of experience through the AI Safety Fundamentals course but need mentorship and support to make their own meaningful research contributions.

Comment by Jacy Reese Anthis (Jacy Reese) on Sentience matters · 2023-05-31T15:06:28.298Z · LW · GW

Thanks for writing this, Nate. This topic is central to our research at Sentience Institute, e.g., "Properly including AIs in the moral circle could improve human-AI relations, reduce human-AI conflict, and reduce the likelihood of human extinction from rogue AI. Moral circle expansion to include the interests of digital minds could facilitate better relations between a nascent AGI and its creators, such that the AGI is more likely to follow instructions and the various optimizers involved in AGI-building are more likely to be aligned with each other. Empirically and theoretically, it seems very challenging to robustly align systems that have an exclusionary relationship such as oppression, abuse, cruelty, or slavery." From Key Questions for Digital Minds.

Comment by Jacy Reese Anthis (Jacy Reese) on "Dangers of AI and the End of Human Civilization" Yudkowsky on Lex Fridman · 2023-03-31T13:47:22.010Z · LW · GW

I disagree with Eliezer Yudkowsky on a lot, but one thing I can say for his credibility is that in possible futures where he's right, nobody will be around to laud his correctness, and in possible futures where he's wrong, it will arguably be very clear how wrong his views were. Even if he has a big ego (as Lex Fridman suggested), this is a good reason to view his position as sincere and—dare I say it—selfless.

Comment by Jacy Reese Anthis (Jacy Reese) on Shutting Down the Lightcone Offices · 2023-03-15T00:53:32.318Z · LW · GW

In particular, I wonder if many people who won't read through a post about offices and logistics would notice and find compelling a standalone post with Oliver's 2nd message and Ben's "broader ecosystem" list—analogous to AGI Ruin: A List of Lethalities. I know related points have been made elsewhere, but I think 95-Theses-style lists have a certain punch.

Comment by Jacy Reese Anthis (Jacy Reese) on You're not a simulation, 'cause you're hallucinating · 2023-02-22T09:36:00.824Z · LW · GW

I like these examples, but can't we still view ChatGPT as a simulator—just a simulator of "Spock in a world where 'The giant that came to tea' is a real movie" instead of "Spock in a world where 'The giant that came to tea' is not a real movie"? You're already posing that Spock, a fictional character, exists, so it's not clear to me that one of these worlds is the right one in any privileged sense.

On the other hand, maybe the world with only one fiction is more intuitive to researchers, so the simulators frame does mislead in practice even if it can be rescued. Personally, I think reframing is possible in essentially all cases, which evidences the approach of drawing on frames (next-token predictors, simulators, agents, oracles, genies) selectively as inspirational and explanatory tools, but unpacking them any time we get into substantive analysis.

Comment by Jacy Reese Anthis (Jacy Reese) on AGI in sight: our look at the game board · 2023-02-20T13:17:51.564Z · LW · GW

Yes.

Comment by Jacy Reese Anthis (Jacy Reese) on AGI in sight: our look at the game board · 2023-02-19T09:45:21.520Z · LW · GW

+1. While I will also respect the request to not state them in the comments, I would bet that you could sample 10 ICML/NeurIPS/ICLR/AISTATS authors and learn about >10 well-defined, not entirely overlapping obstacles of this sort.

We don’t have any obstacle left in mind that we don’t expect to get overcome in more than 6 months after efforts are invested to take it down.

I don't want people to skim this post and get the impression that this is a common view in ML.

Comment by Jacy Reese Anthis (Jacy Reese) on Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight · 2022-12-16T18:30:31.868Z · LW · GW

Interesting! I'm not sure what you're saying here. Which of those two things (shard theory and shard theory) is shard theory (written without a subscript)? If the former, then the OP seems accurate. If the latter or if shard theory without a subscript includes both of those two things, then I misread your view and will edit the post to note that this comment supersedes (my reading of) your previous statement.

Comment by Jacy Reese Anthis (Jacy Reese) on Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue) · 2022-11-22T18:18:52.084Z · LW · GW

Had you seen the researcher explanation for the March 2022 "AI suggested 40,000 new possible chemical weapons in just six hours" paper? I quote (paywall):

Our drug discovery company received an invitation to contribute a presentation on how AI technologies for drug discovery could potentially be misused.

Risk of misuse

The thought had never previously struck us. We were vaguely aware of security concerns around work with pathogens or toxic chemicals, but that did not relate to us; we primarily operate in a virtual setting. Our work is rooted in building machine learning models for therapeutic and toxic targets to better assist in the design of new molecules for drug discovery. We have spent decades using computers and AI to improve human health—not to degrade it. We were naive in thinking about the potential misuse of our trade, as our aim had always been to avoid molecular features that could interfere with the many different classes of proteins essential to human life. Even our projects on Ebola and neurotoxins, which could have sparked thoughts about the potential negative implications of our machine learning models, had not set our alarm bells ringing.

Our company—Collaborations Pharmaceuticals, Inc.—had recently published computational machine learning models for toxicity prediction in different areas, and, in developing our presentation to the Spiez meeting, we opted to explore how AI could be used to design toxic molecules. It was a thought exercise we had not considered before that ultimately evolved into a computational proof of concept for making biochemical weapons.

Comment by Jacy Reese Anthis (Jacy Reese) on Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue) · 2022-11-22T16:55:11.100Z · LW · GW

Five days ago, AI safety YouTuber Rob Miles posted on Twitter, "Can we all agree to not train AI to superhuman levels at Full Press Diplomacy? Can we please just not?"

Comment by Jacy Reese Anthis (Jacy Reese) on Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight · 2022-11-18T11:46:43.306Z · LW · GW

Hm, the begging question meaning is probably just a verbal dispute, but I don't think asking questions can in general beg questions because they don't have conclusions. There is no "assuming its conclusion is true" if there is no conclusion. Not a big deal though!

I wouldn't say values are independent (i.e., orthogonal) at best; they are often highly correlated, such as values of "have enjoyable experiences" and "satisfy hunger" both leading to eating tasty meals. I agree they are often contradictory, and this is one valid model of catastrophic addiction or mild problems. I think any rigorous theory of "values" (shard theory or otherwise) will need to make sense of those phenomena, but I don't see that as an issue for the claim "ensure alignment with its values" because I don't think alignment requires complete satisfaction of every value, which is almost always impossible.

Comment by Jacy Reese Anthis (Jacy Reese) on Unpacking "Shard Theory" as Hunch, Question, Theory, and Insight · 2022-11-17T14:13:47.208Z · LW · GW

Thanks for the comment. I take "beg the question" to mean "assumes its conclusion," but it seems like you just mean Point 2 assumes something you disagree with, which is fair. I can see reasonable definitions of aligned and misaligned in which brains would fall into either category. For example, insofar as our values are a certain sort of evolutionary (e.g., valuing reproduction), human brains have misaligned mesaoptimization like craving sugar. If sugar craving itself is the value, then arguably we're well-aligned.

In terms of synthesizing an illusion, what exactly would make it illusory? If the synthesis (i.e., combination of the various shards and associated data) is leading to brains going about their business in a not-catastrophic way (e.g., not being constantly insane or paralyzed), then that seems to meet the bar for alignment that many, particularly agent foundations proponents, favor. See, for example, Nate's recent post:

Unfortunately, the current frontier for alignment research is “can we figure out how to point AGI at anything?”. By far the most likely outcome is that we screw up alignment and destroy ourselves.

The example I like is just getting an AI to fill a container of water, which human brains are able to do, but in Fantasia, the sorceror's apprentice Mickey Mouse was not able to do! So that's a basic sense in which brains are aligned, but again I'm not sure how exactly you would differentiate alignment with its values from synthesis of an illusion.

Comment by Jacy Reese Anthis (Jacy Reese) on The Big Picture Of Alignment (Talk Part 1) · 2022-06-06T00:11:30.126Z · LW · GW

Very interesting! Zack's two questions were also the top two questions that came to mind for me. I'm not sure if you got around to writing this up in more detail, John, but I'll jot down the way I tentatively view this differently. Of course I've given this vastly less thought than you have, so many grains of salt.

On "If this is so hard, how do humans and other agents arguably do it so easily all the time?", how meaningful is the notion of extra parameters if most agents are able to find uses for any parameters, even just through redundancy or error-correction (e.g., in case one base pair changes through exaptation or useless mutation)? In alignment, why assume that all aligned AIs "look like they work"? Why assume that these are binaries? Etc. In general, there seem to be many realistic additions to your model that mitigate this exponential-increase-in-possibilities challenge and seem to more closely fit real-world agents who are successful. I don't see as many such additions that would make the optimization even more challenging.

On generators, why should we carve such a clear and small circle around genes as the generators? Rob mentioned the common thought experiment of alien worlds in which genes produce babies who grow up in isolation from human civilization, and I would push on that further. Even on Earth, we have Stone Age values versus modern values, and if you draw the line more widely (either by calling more things generators or including non-generators), this notion of "generators of human values" starts to seem very narrow and much less meaningful for alignment or a general understanding of agency, which I think most people would say requires learning more values than what is in our genes. I don't think "feed an AI data" gets around this: AIs already have easy access to genes and to humans of all ages. There is an advantage to telling the AI "these are the genes that matter," but could it really just take those genes or their mapping onto some value space and raise virtual value-children in a useful way? How do they know they aren't leaving out the important differentiators between Stone Age and modern values, genetic or otherwise? How would they adjudicate between all the variation in values from all of these sources? How could we map them onto trade-offs suitable for coherence conditions? Etc.

Comment by Jacy Reese Anthis (Jacy Reese) on The Big Picture Of Alignment (Talk Part 2) · 2022-06-06T00:09:55.009Z · LW · GW

This is great. One question it raises for me is: Why is there a common assumption in AI safety that values are a sort of existent (i.e., they exist) and latent (i.e., they are not directly observable) phenomena? I don't think those are unreasonable partial definitions of "values," but they're far from the only ones and not at all obvious that they're the values with which we want to align AI. Philosophers Iason Gabriel (2020) and Patrick Butlin (2021) have pointed out some of the many definitions of "values" that we could use for AI safety.

I understand that just picking an operationalization and sticking to it may be necessary for some technical research, but I worry that the gloss reifies these particular criteria and may even reify semantic issues (e.g., Which latent phenomena do we want to describe as "values"?; a sort of verbal dispute a la Chalmers) incorrectly as substantive issues (e.g., How do we align an AI with the true values?).

Comment by Jacy Reese Anthis (Jacy Reese) on Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc · 2022-06-04T20:27:01.270Z · LW · GW

I strongly agree. There are two claims here. The weak one is that, if you hold complexity constant, directed acyclic graphs (DAGs; Bayes nets or otherwise) are not necessarily any more interpretable than conventional NNs because NNs are DAGs at that level. I don't think anyone who understands this claim would disagree with it.

But that is not the argument being put forth by Pearl/Marcus/etc. and arguably contested by LeCun/etc.; they claim that in practice (i.e., not holding anything constant), DAG-inspired or symbolic/hybrid AI approaches like Neural Causal Models have interpretability gains without much if any drop in performance, and arguably better performance on tasks that matter most. For example, they point to the 2021 NetHack Challenge, a difficult roguelike video game where non-NN performance still exceeds NN performance.

Of course there's not really a general answer here, only specific answers to specific questions like, "Will a NN or non-NN model win the 2024 NetHack challenge?"

Comment by Jacy Reese Anthis (Jacy Reese) on For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines. · 2022-04-21T13:18:17.572Z · LW · GW

I don't think the "actual claim" is necessarily true. You need more assumptions than a fixed difficulty of AGI, assumptions that I don't think everyone would agree with. I walk through two examples in my comment: one that implies "Gradual take-off implies shorter timelines" and one that implies "Gradual take-off implies longer timelines."

Comment by Jacy Reese Anthis (Jacy Reese) on For every choice of AGI difficulty, conditioning on gradual take-off implies shorter timelines. · 2022-04-21T12:52:18.565Z · LW · GW

I agree with this post that the accelerative forces of gradual take-off (e.g., "economic value... more funding... freeing up people to work on AI...") are important and not everyone considers them when thinking through timelines.

However, I think the specific argument that "Gradual take-off implies shorter timelines" requires a prior belief that not everyone shares, such as a prior that an AGI of difficulty D will occur in the same year in both timelines. I don't think such a prior is implied by "conditioned on a given level of “AGI difficulty”". Here are two example priors, one that leads to "Gradual take-off implies shorter timelines" and one that leads to the opposite. The first sentence of each is most important.

Gradual take-off implies shorter timelines
Step 1: (Prior) Set AGI of difficulty D to occur at the same year Y in the gradual and sudden take-off timelines.
Step 2: Notice that the gradual take-off timeline has AIs of difficulties like 0.5D sooner, which would make AGI occur sooner than Y because of the accelerative forces of "economic value... more funding... freeing up people to work on AI..." etc. Therefore, move AGI occurrence in gradual take-off from Y to some year before Y, such as 0.5Y.

=> AGI occurs at 0.5Y in the gradual timeline and Y in the sudden timeline.

Gradual take-off implies longer timelines
Step 1: (Prior) Set AI of difficulty 0.5D to occur at the same year Y in the gradual and sudden take-off timelines. To fill in AGI of difficulty D in each timeline, suppose that both are superlinear but sudden AGI arrives at exactly Y and gradual AGI arrives at 1.5Y.
Step 2: Notice that the gradual take-off timeline has AIs of difficulties like 0.25D sooner, which would make AGI occur sooner than Y because of the accelerative forces of "economic value... more funding... freeing up people to work on AI..." etc. Therefore, move 0.5D AI occurrence in gradual take-off from Y to some year before Y, such as Y/2, and move AGI occurrence in gradual take-off correspondingly from 1.5Y to 1.25Y.

=> AGI occurs at 1.25Y in the gradual timeline and Y in the sudden timeline.

By the way, this is separate from Stefan_Schubert's critique that very short timelines are possible with sudden take-off but not with gradual take-off, which I personally think can be considered a counterexample if we treat the impossibility of gradual take-off as "long" but not really a counterexample if we just consider the shortness comparison to be indeterminate because there are no very short gradual timelines.

Comment by Jacy Reese Anthis (Jacy Reese) on 12 interesting things I learned studying the discovery of nature's laws · 2022-03-10T16:34:44.939Z · LW · GW

I think another important part of Pearl's journey was that during his transition from Bayesian networks to causal inference, he was very frustrated with the correlational turn in early 1900s statistics. Because causality is so philosophically fraught and often intractable, statisticians shifted to regressions and other acausal models. Pearl sees that as throwing out the baby (important causal questions and answers) with the bathwater (messy empirics and a lack of mathematical language for causality, which is why he coined the do operator).

Pearl discusses this at length in The Book of Why, particularly the Chapter 2 sections on "Galton and the Abandoned Quest" and "Pearson: The Wrath of the Zealot." My guess is that Pearl's frustration with statisticians' focus on correlation was immediate upon getting to know the field, but I don't think he's publicly said how his frustration began.

Comment by Jacy Reese Anthis (Jacy Reese) on Redwood Research’s current project · 2022-03-10T16:20:43.310Z · LW · GW

This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently.

I think you might have better performance if you train your own DeBERTa XL-like model with classification of different snippets as a secondary objective alongside masked token prediction, rather than just fine-tuning with that classification after the initial model training. (You might use different snippets in each step to avoid double-dipping the information in that sample, analogous to splitting text data for causal inference, e.g., Egami et al 2018.) The Hugging Face DeBERTa XL might not contain the features that would be most useful for the follow-up task of nonviolence fine-tuning. However, that might be a less interesting exercise if you want to build tools for working with more naturalistic models.

Comment by Jacy Reese Anthis (Jacy Reese) on Value extrapolation, concept extrapolation, model splintering · 2022-03-10T16:12:02.092Z · LW · GW

I appreciate making these notions more precise. Model splintering seems closely related to other popular notions in ML, particularly underspecification ("many predictors f that a pipeline could return with similar predictive risk"), the Rashomon effect ("many different explanations exist for the same phenomenon"), and predictive multiplicity ("the ability of a prediction problem to admit competing models with conflicting predictions"), as well as more general notions of generalizability and out-of-sample or out-of-domain performance. I'd be curious what exactly makes model splintering different. Some example questions: Is the difference just the alignment context? Is it that "splintering" refers specifically to features and concepts within the model failing to generalize, rather than the model as a whole failing to generalize? If so, what does it even mean for the model as a whole to fail to generalize but not features failing to generalize? Is it that the aggregation of features is not a feature? And how are features and concepts different from each other, if they are?

Comment by Jacy Reese Anthis (Jacy Reese) on Preliminary thoughts on moral weight · 2018-08-15T12:41:02.413Z · LW · GW

I think most thinkers on this topic wouldn't think of those weights as arbitrary (I know you and I do, as hardcore moral anti-realists), and they wouldn't find it prohibitively difficult to introduce those weights into the calculations. Not sure if you agree with me there.

I do agree with you that you can't do moral weight calculations without those weights, assuming you are weighing moral theories and not just empirical likelihoods of mental capacities.

I should also note that I do think intertheoretic comparisons become an issue in other cases of moral uncertainty, such as with infinite values (e.g. a moral framework that absolutely prohibits lying). But those cases seem much harder than moral weights between sentient beings under utilitarianism.

Comment by Jacy Reese Anthis (Jacy Reese) on Preliminary thoughts on moral weight · 2018-08-15T11:48:28.228Z · LW · GW

I don't think the two-elephants problem is as fatal to moral weight calculations as you suggest (e.g. "this doesn't actually work"). The two-envelopes problem isn't a mathematical impossibility; it's just an interesting example of mathematical sleight-of-hand.

Brian's discussion of two-envelopes is just to point out that moral weight calculations require a common scale across different utility functions (e.g. the decision to fix the moral weight of a human at 1 whether you're using brain size, all-animals-are-equal, unity-weighting, or any other weighing approach). It's not to say that there's a philosophical or mathematical impossibility in doing these calculations, as far as I understand.

FYI I discussed this a little with Brian before commenting, and he subsequently edited his post a little, though I'm not yet sure if we're in agreement on the topic.