Posts
Comments
How do CFAR's research interests/priorities compare with LW's Open Problems in Human Rationality? Based on Brienne and Anna's replies here, I suspect the answer is "they're pretty different", but I'd like to hear what accounts for this divergence.
Nitpick: "transfer learning" is the standard term, no? It has a Wiki page and seems to get a more coherent batch of search results than googling "robustness to data shift".
Whoops, mea culpa on that one! Deleted and changed to:
the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.
In reasoning about AGI, we're all aware of the problems with anthropomorphizing, but it occurs to me that there's also a cluster of bad reasoning that comes from an (almost?) opposite direction, where you visualize an AGI to be a mechanical automaton and draw naive conclusions based on that.
For instance, every now and then I've heard someone from this community say something like:
What if the AGI runs on the ZFC axioms (among other things), and finds a contradiction, and by the principle of explosion it goes completely haywire?
Even if ZFC is inconsistent, this hardly seems like a legitimate concern. There's no reason to hard-code ZFC into an AI unless we want a narrow AI that's just a theorem prover (e.g. Logic Theorist). Anything close to AGI will necessarily build rich world models, and from the standpoint of these, ZFC wouldn't literally be everything. ZFC would just be a sometimes-useful tool it discovers for organizing its mathematical thinking, which in turn is just a means toward understanding physics etc. better, much as humans wouldn't go crazy if ZFC yields a contradiction.
The general fallacy I'm pointing to isn't just "AGI will be logic-based" but something more like "AGI will act like a machine, an automaton, or a giant look-up table". This is technically true, in the same way humans can be perfectly described as a giant look-up table, but it's just the wrong level of abstraction for thinking about agents (most of the time) and can lead one to silly conclusions if one isn't really careful.
For instance my (2nd hand, half-baked, and lazy) understanding of Penrose's arguments are as follows: Godel's theorems say formal systems can't do X, humans can do X, therefore human brains can't be fully described as formal systems (or maybe he references Turing machines and the halting problem, but the point is still similar). Note that this makes sense as stated, the catch is that
"the human brain when broken down all the way to a Turing machine" is what the Godel/Turing stuff applies to, not "the human brain at the level of abstraction we use to think about it (in terms of 'thoughts', 'concepts', etc.)". It's not at all clear that the latter even resembles a formal system, at least not one rich enough that the Godel/Turing results apply. The fact that it's "built out of" the former means nothing on this point: the proofs of PA > 10 characters do not constitute a formal system, and fleshing out the "built out of" probably requires solving a large chunk of neuroscience.
Again, I'm just using straw-Penrose here as an example because, while we all agree it's an invalid argument, this is mostly because it concludes something LW overwhelmingly agrees is false. When taken at face value, it "looks right" and the actual error isn't completely obvious to find and spell out (hence I've left it in a black spoiler box). I claim that if the argument draws a conclusion that isn't obviously wrong or even reinforces your existing viewpoint, then it's relatively easy to think it makes sense. I think this is what's going on when people here make arguments for AGI dangers that appeal to its potential brittleness or automata-like nature (I'm not saying this is common, but I do see it occasionally).
But there's a subtlety here, because there are some ways in which AGI potentially will be more brittle due to its mathematical formulation. For instance, adversarial examples are a real concern, and those are pretty much only possible because of the way ML systems output numerical probabilities (from these the adversary can infer the gradient of the model's beliefs, and run along it).
And of course, as I said at the start, an opposing fallacy is thinking AGI will be more human-like by default. To be clear I think the fallacy I'm gesturing at here is the less dangerous one in the worst case, but more common on LW (i.e. > 0).
[copying from my comment on the EA Forum x-post]
For reference, some other lists of AI safety problems that can be tackled by non-AI people:
Luke Muehlhauser's big (but somewhat old) list: "How to study superintelligence strategy"
AI Impacts has made several lists of research problems
Wei Dai's, "Problems in AI Alignment that philosophers could potentially contribute to"
Kaj Sotala's case for the relevance of psychology/cog sci to AI safety (I would add that Ought is currently testing the feasibility of IDA/Debate by doing psychological research)
*begins drafting longer proposal*
Yeah, this is definitely more high-risk, high-reward than the others, and the fact that there's potentially some very substantial spillover effects if successful makes me both excited and nervous about the concept. I'm thinking of Arbital as an example of "trying to solve way too many problems at once", so I want to manage expectations and just try to make some exercises that inspire people to think about the art of mathematizing certain fuzzy philosophical concepts. (Running title is "Formalization Exercises", but I'm not sure if there's a better pithy name that captures it).
In any case, I appreciate the feedback, Mr. Entworth.
(8)
In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presumably) had to be developed to make the finished product. I get that this will be hard, but I think this can be feasibly done for some of the (mostly easier) concepts, and if done really well, it could even be a better way for people to learn those concepts than actually reading about them.
(7)
A critique of MIRI’s “Fixed Points” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.
(6)
An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy and what’s easy-to-measure vs. hard-to-measure. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential “differential progresses” that ML could precipitate. Which, argh, sounds like an exhausting task, but someone should do it?
(5)
A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]
(4)
A post discussing my confusions about Goodhart and Garrabrant’s taxonomy of it. I find myself not completely satisfied with it:
1) “adversarial” seems too broad to be that useful as a category
2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad
3) Whereas “regressional” and “extremal” (and perhaps “causal”) are defined statistically, “adversarial” is defined in terms of agents, and this may have downsides (I’m less sure about this objection)
But I’m also not sure how I’d reclassify it and that task seems hard. Which partially updates me in favor of the Taxonomy being good, but at the very least I feel there’s more to say about it.
(3)
“When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart” intuition applies and when it breaks.
(2)
[I probably need a better term for this] “Wide-open-source game theory”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context of how, even if an AGI makes the right decision, we care “why” it did so, i.e. because it’s optimizing for what we want vs. optimizing for human approval for instrumental reasons). I doubt we’ll formalize this “why” anytime soon (see e.g. section 5 of this), but I think semi-formal things can be said about it upon some effort. [I thought of this independently from (1), but I think every level of the “transparency hierarchy” could have its own kind of game theory, much like the “open-source” level clearly does]
(1)
A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).
Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful.
By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.
I have a bunch of half-baked ideas, most of which are mediocre in expectation and probably not worth investing my time and other’s attention writing up. Some of them probably are decent, but I’m not sure which ones, and the user base is probably as good as any for feedback.
So I’m just going to post them all as replies to this comment. Upvote if they seem promising, downvote if not. Comments encouraged. I reserve the “right” to maintain my inside view, but I wouldn’t make this poll if I didn’t put substantial weight on this community’s opinions.
This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some X exhibits agent-like behavior, does it follow that there exists some interesting physical structure causally upstream of X?
Neat example! But for my part, I'm confused about this last sentence, even after reading the footnote:
An example of such "interesting physical structure" would be an implementation of an optimization architecture.
For one thing, I'm not sure I have much intuition about what is meant by "optimization architecture". For instance, I would not know how to begin answering the question:
Does optimization behavior imply optimization architecture?
And I have even less of a clue what is intended by "interesting physical structure" (perhaps facetiously, any process that causes agent-like behavior to arise sounds "interesting" for that reason alone).
In your ant colony example, is evolution the "interesting physical structure", and if so, how is it a physical structure?
For reference, LeCun discussed his atheoretic/experimentalist views in more depth in this FB debate with Ali Rahimi and also this lecture. But maybe we should distinguish some distinct axes of the experimentalist/theorist divide in DL:
1) Experimentalism/theorism is a more appropriate paradigm for thinking about AI safety
2) Experimentalism/theorism is a more appropriate paradigm for making progress in AI capabilities
Where the LeCun/Russell debate is about (1) and LeCun/Rahimi is about (2). And maybe this is oversimplifying things, since "theorism" may be an overly broad way of describing Russell/Rahimi's views on safety/capabilities, but I suspect LeCun is "seeing the same ghost", or in his words (to Rahimi), seeing the same:
kind of attitude that lead the ML community to abandon neural nets for over 10 years, *despite* ample empirical evidence that they worked very well in many situations.
And whether or not Rahimi should be lumped into that "kind of attitude", I think LeCun is right (from a certain perspective) to want to push back against that attitude.
I'd even go further: given that LeCun has been more successful than Rahimi/Russell in AI research this century, all else equal I would weight the former's intuitions on research progress more. (I think the best counterargument is that while experimentalism might be better in the short-term, theorism has better payoff in the long-term, but I'm not sure about this.)
In fact, one of my major fears is that LeCun is right about this, because even if he is right about (2), I don't think that's good evidence he's right about (1) since these seem pretty orthogonal. But they don't look orthogonal until you spend a lot of time reading/thinking about AI safety, which you're not inclined to do if you already know a lot about AI and assume that knowledge transfers to AI safety.
In other words, the "correct" intuitions (on experimentalism/theorism) for modern AI research might be the opposite of the "correct" intuitions for AI safety. (I would, for instance, predict that if Superintelligence were published during the era of GOFAI, all else equal it would've made a bigger splash because AI researchers then were more receptive to abstract theorizing.)
That all seems pretty fair.
If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent.
That's why I distinguished between the hypotheses of "human utility" and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the "extrapolation" less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem. For the former group, human utility is more salient, while the latter probably cares more about the CEV hypothesis (and the arguments you list in favor of it).
Arguably, you can't fully align with inconsistent preferences
My intuitions tend to agree, but I'm also inclined to ask "why not?" e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it "unaligned" with me? More generally, what is it about these other coherence conditions that prevent meaningful "alignment"? (Maybe it takes a big discursive can of worms, but I actually haven't seen this discussed on a serious level so I'm quite happy to just read references).
Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).
That's still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.
Hadn't thought about it this way. Partially updated (but still unsure what I think).
To be clear I unendorsed the idea about a minute after posting because it felt like more of a low-effort shitpost than a constructive idea for understanding the world (and I don't want to make that a norm on shortform). That said I had in mind that you're describing the thing to someone who you can't communicate with beforehand, except there's common knowledge that you're forbidden any nouns besides "cake". In practice I feel like it degenerates to putting all the meaning on adjectives to construct the nouns you'd want to use. E.g. your own "speaking cake" to denote a person, "flat, vertical, compartmentalizing cakes" to denote walls. Of course you'd have to ban any "-like" and "-esque" constructions and similar things, but it's not clear to me if the boundaries there are too fuzzy to make a good rule set.
Actually, maybe this could be a board game similar to charades. You get a random word such as "elephant", and you write down a description of it with this constraint. Then the description is gradually read off, and your team tries to guess the word based on the description. It's inverse to charades in that the reading is monotonous and w/o body language (and could even be done by the other team).
K-complexity: The minimum description length of something (relative to some fixed description language)
Cake-complexity: The minimum description length of something, where the only noun you can use is "cake"
I often hear about deepfakes--pictures/videos that can be entirely synthesized by a deep learning model and made to look real--and how this could greatly amplify the "fake news" phenomenon and really undermine the ability of the public to actually evaluate evidence.
And this sounds like a well-founded worry, but then I was just thinking, what about Photoshop? That's existed for over a decade, and for all that time it's been possible to doctor images to look real. So why should deepfakes be any scarier?
Part of it could be that we can fake videos, not just images, but that can't be all of it.
I suspect the main reason is that in the future, deepfakes will also be able to fool experts. This does seem like an important threshold.
This raises another question: is it, in fact, impossible to fool experts with Photoshop? Are there fundamental limitations on it that prevent it from being this potent, and this was always understood so people weren't particularly fearful of it? (FWIW when I learned about Photoshop as a kid I freaked out with Orwellian visions even worse than people have with deepfakes now, and pretty much only relaxed out of conformity. I remain ignorant about the technical details of Photoshop and its capabilities)
But even if deepfakes are bound to cross this threshold (not that it's a fine line) in a way Photoshop never could, aren't there also plenty of things which experts have had and do have trouble classifying as real/fake? Wikipedia's list of hoaxes is extensive, albeit most of those fooled the public rather than experts. But I feel like there are plenty of hoaxes that lasted hundreds of years before being debunked (Shroud of Turin, or maybe fake fossils?).
I guess we're just used to seeing less hoaxes in modern times. Like, in the past hoaxes abounded, and there often weren't the proper experts around to debunk them, so probably those times warranted a greater degree of epistemic learned helplessness or something. But since the last century, our forgery-spotting techniques have gotten a lot better while the corresponding forgeries just haven't kept up, so we just happen to live in a time where the "offense" is relatively weaker than the "defense", but there's no particular reason it should stay that way.
I'm really not sure how worried I should be about deepfakes, but having just thought through all that, it does seem like the existence of "evidence" in political discourse is not an all-or-nothing phenomenon. Images/videos will likely come to be trusted less, maybe other things as well if deep learning contributes in other ways to the "offense" more than the "defense". And maybe things will reach a not-so-much-worse equilibrium. Or maybe not, but the deepfake phenomenon certainly does not seem completely new.
Is this open thread not going to be a monthly thing?
FWIW I liked reading the comment threads here, and would be inclined to participate in the future. But that's just my opinion. I'm curious if more senior people had reasons for not liking the idea?
I'm not asking about the Fermi paradox, and its unclear to me how that's related. I'm wondering why we think general (i.e. human-level) intelligence is possible in our universe, if we're not allowed to invoke anthropic evidence. For instance, here's some possible ways one can answer my question [rot13'd to avoid spoiling people's answers]:
1. Nethr gung aba-cevzngr navzny vagryyvtrapr nyernql trgf hf "zbfg bs gur jnl gurer", naq tvira gur nccebcevngr raivebazrag, vg fubhyq or cbffvoyr va cevapvcyr sbe n fhpprffvba bs navzny fcrpvrf gb ribyir gur erznvavat pncnovyvgvrf.
2. [Rnfl Zbqr] Nethr gung qrrc yrneavat nyernql trgf hf "zbfg bs gur jnl gurer", naq vg fubhyq or cbffvoyr, jvgu rabhtu genvavat qngn naq gur nccebcevngr nytbevguzvp gjrnxf, gb trg n qrrc yrneavat nytbevguz gb qb trareny-checbfr ernfbavat engure rssvpvragyl.
3. [Uneq Zbqr] Nethr gung uvtu-yriry zngurzngvpny pbafgehpgf, fhpu nf havirefny Ghevat znpuvarf (be creuncf zber pbaivapvatyl, NVKV), fubj gung yrneavat va irel trareny raivebazragf vf cbffvoyr jvgu hayvzvgrq pbzchgr. Gura nethr gung zhpu bs guvf pna or nccebkvzngrq ol srnfvoyr (r.t. cbylabzvny gvzr) nytbevguzf gung pna eha va erny-gvzr ba n oenva/pbzchgre zhpu fznyyre guna n cynarg.
I'm unsure what information you need about what the "you" in this counterfactual is? Beyond "alien from a different universe general-purpose reasoning algorithms are different enough those of Earth-based animals that they can't infer anything about the potential of Earth-based biological intelligence", I'd be unable to give the details of those algorithms (and it shouldn't matter anyways?).
Huh, that's a good point. Whereas it seems probably inevitable that AI research would've eventually converged on something similar to the current D(R)L paradigm, we can imagine a lot of different ways AI safety could have looked like instead right now. Which makes sense, since the latter is still young and in a kind of pre-paradigmatic philosophical stage, with little unambiguous feedback to dictate how things should unfold (and it's far from clear when substantially more of this feedback will show up).
I can imagine an alternate timeline where the initial core ideas/impetus for AI safety didn't come from Yudkowsky/LW, but from e.g. a) Bostrom/FHI b) Stuart Russell or c) some near-term ML safety researchers whose thinking gradually evolved as they thought about longer and longer timescales. And it's interesting to ask what the current field would consequently look like:
- Agent Foundations/Embedded Agency probably (?) wouldn't be a thing, or at least it would might take some time for the underlying questions which motivate it to be asked in writing, let alone the actual questions within those agendas (or something close to them)
- For (c) primarily, its unclear if the alignment problem would've been zeroed in on as the "central challenge", or how long this would take (note: I don't actually know that much about near-term concerns, but I can imagine things like verification, adversarial examples, and algorithmic fairness lingering around on center stage for a while).
- A lot of the focus on utility functions probably wouldn't be there
And none of that is to say anything about those alternate timelines is better, but is to say that a lot of the things I often associate with AI safety are only contingently related. This is probably obvious to a lot of people on here, and of course we have seen some of the Yudkowskian foundational framings of the problem have been de-emphasized as non-LW people have joined the field.
On the other hand, as far as "lock-in" itself is concerned, it does seem like there's a certain amount of deference that EA has given MIRI/LW on some of the more abstruse matters where would-be critics don't want to sound stupid for lack of technical sophistication--UDT, Solomonoff, and similar stuff internal to agent foundations--and the longer any idea lingers around, and the farther it spreads, the harder it is to root out if we ever do find good reasons to overturn it. Although I'm not that worried about this, since those ideas are by definition only fully understood/debated by a small part of the community.
Also, it's my impression that most EAs believe in one-boxing, but not necessarily UDT. For instance, some apparently prefer EDT-like theories, which makes me think the relatively simple arguments for one-boxing have percolated pretty widely (and are probably locked in), but the more advanced details are still largely up for debate. I think similar things can be said for a lot of other things, e.g. "thinking probabilistically" is locked in but maybe not a lot of the more complicated aspects of Bayesian epistemology that have come out of LW.
Yes, perhaps I should've been more clear. Learning certain distance functions is a practical solution to some things, so maybe the phrase "distance functions are hard" is too simplistic. What I meant to say is more like
Fully-specified distance functions are hard, over and above the difficulty of formally specifying most things, and it's often hard to notice this difficulty
This is mostly applicable to Agent Foundations-like research, where we are trying to give a formal model of (some aspect of) how agents work. Sometimes, we can reduce our problem to defining the appropriate distance function, and it can feel like we've made some progress, but we haven't actually gotten anywhere (the first two examples in the post are like this).
The 3rd example, where we are trying to formally verify an ML model against adversarial examples, is a bit different now that I think of it. Here we apparently need transparent, formally-specified distance function if we have any hope of absolutely proving the absence of adversarial examples. And in formal verification, the specification problem often is just philosophically hard like this. So I suppose this example is less insightful, except insofar as it lends extra intuitions for the other class of examples.
Yes, here: https://www.lesswrong.com/posts/QePFiEKZ4R2KnxMkW/posts-i-repent-of