Comment by wei_dai on Please give your links speaking names! · 2019-07-11T12:40:08.851Z · score: 13 (7 votes) · LW · GW

Other reasons to do it include accessibility and letting people know whether they've already read the linked article without having to hover over the link to view its URL. However I sometimes still don't do it because of the costs, such as breaking up the flow of the text, making a comment seem more formal than I prefer, and just the effort of typing or copy/pasting the article title (especially on mobile).

Comment by wei_dai on IRL in General Environments · 2019-07-11T01:42:34.174Z · score: 14 (7 votes) · LW · GW

Regardless of whether it is intended or not, this sounds like a dig at CHAI’s work. I do not think that IRL is “nearly complete”. I expect that researchers who have been at CHAI for at least a year do not think that IRL is “nearly complete”. I wrote a sequence partly for the purpose of telling everyone “No, really, we don’t think that we just need to run IRL to get the one true utility function; we aren’t even investigating that plan”.

I think Stuart Russell still gives this impression in his (many) articles and interviews. I remember getting this impression listening to a recent interview, but will quote this Nov 2018 article instead since many of his interviews don't have transcripts:

Machines are beneficial to the extent that their actions can be expected to achieve our objectives [...]

It turns out, however, that it is possible to define a mathematical framework leading to machines that are provably beneficial in this sense. That is, we define a formal problem for machines to solve, and, if they solve it, they are guaranteed to be beneficial to us. In its simplest form, it goes like this:

  • The world contains a human and a machine.
  • The human has preferences about the future and acts (roughly) in accordance with them.
  • The machine’s objective is to optimise for those preferences.
  • The machine is explicitly uncertain as to what they are. [...]

There are two primary sources of difficulty that we are working on right now: satisfying the preferences of many humans and understanding the preferences of real humans. [...]

Machines will need to “invert” actual human behaviour to learn the underlying preferences that drive it.

Does this not sound like a plan of running (C)IRL to get the one true utility function?

Comment by wei_dai on LW authors: How many clusters of norms do you (personally) want? · 2019-07-08T01:27:59.868Z · score: 13 (6 votes) · LW · GW

I think I don't have any strong object-level preferences along these lines, and if norm clusters develop, will probably end up copying/adopting whatever norm cluster that seems to produce the most vibrant, highest quality discussions.

One thing I would really like though, is a chance to experiment with this idea to see what effects it has on discussions (hopefully positive ones), and would definitely enable it for myself if it was an option that authors could choose for their comment sections.

Comment by wei_dai on Self-consciousness wants to make everything about itself · 2019-07-05T21:36:23.804Z · score: 12 (6 votes) · LW · GW

Fun tends to be highly personal, for example some people find free soloing fun, and others find it terrifying. Some people enjoy strategy games and others much prefer action games. So it seems surprising that you'd give an unconditional "should have" criticism/advice based on what you think is fun. I mean, you wouldn't say to someone, "you should not have used safety equipment during that climb." At most you'd say, "you should try not using safety equipment next time and see if that's more fun for you."

Comment by wei_dai on Self-consciousness wants to make everything about itself · 2019-07-05T06:10:29.130Z · score: 14 (6 votes) · LW · GW

If you find yourself in that situation again, try saying “no you” and improvise from there.

Why take the risk of this escalating into a seriously negative sum outcome? I can imagine the risk being worth it for someone who needs to constantly show others that they can "handle themselves" and won't easily back down from perceived threats and slights. But presumably habryka is not in that kind of social circumstances, so I don't understand your reasoning here.

Comment by wei_dai on Problems with Counterfactual Oracles · 2019-07-05T05:18:14.854Z · score: 4 (2 votes) · LW · GW

But for 2., how do we get an automated system and containment setup that is secure against a superintelligence?

Well, that's what the current contest is about (in part). How you been following it? But having said that, this conversation is making me realize that some of the ideas proposed there may not make as much sense as I thought.

I’m generally confused about what capabilities are assumed—is it just souped-up modern ML?

Yeah I'm confused about this too. I asked Stuart and he didn't really give a useful answer. I guess "under what assumed capabilities would Counterfactual Oracles be safe and useful" is also part of what needs to be worked out.

Even worse, it could (if sufficiently intelligent) subtly transfer or otherwise preserve itself before being shut down. Why are we assuming we can just shut it down, given that we have to give it at least a little time to think and train?

Are you thinking that the Oracle might have cross-episode preferences? I think to ensure safety we have to have some way to make sure that the Oracle only cares about doing well (i.e., getting a high reward) on the specific question that it's given, and nothing else, and this may be a hard problem.

Comment by wei_dai on Opting into Experimental LW Features · 2019-07-05T04:25:20.044Z · score: 3 (1 votes) · LW · GW

I typed that on mobile so couldn't explain more, but I think there's no need to automatically collapse comments, if you just make the comments take up the whole width of the browser window, and highlight the comments with high karma. This way the user can easily browse through lots of comments by just scrolling down, and choose which comments direct their eyes at, without having to click on anything.

(The automatic collapse feature might still be useful on smaller screens.)

Comment by wei_dai on Opting into Experimental LW Features · 2019-07-05T03:13:00.618Z · score: 8 (3 votes) · LW · GW

Have you seen my code for doing this back in LW1? It's not working now for obvious reasons, but you can take a look the screenshot to get an idea.

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-05T00:22:36.975Z · score: 3 (1 votes) · LW · GW

(I'm still confused and thinking about this, but figure I might as well write this down before someone else does. :)

While thinking more about my submission and counterfactual Oracles in general, this class of ideas for using CO is starting to look like trying to implement supervised learning on top of RL capabilities, because SL seems safer (less prone to manipulation) than RL. Would it ever make sense to do this in reality (instead of just doing SL directly)?

Comment by wei_dai on Problems with Counterfactual Oracles · 2019-07-04T17:57:58.463Z · score: 6 (3 votes) · LW · GW

It’s alright if the proposal isn’t perfect, but heuristically I’d want to see something like “here’s an analysis of why manipulation happens, and here are principled reasons to think that this proposal averts some or all of the causes”.

This seems fair, and I think one answer to this is (thanks to the framing provided by Michaël Trazzi):

  1. Manipulation happens because the Oracle can learn that it can get a high reward by producing a manipulative output.
  2. To avoid this, we can avoid letting any human or system that might be vulnerable to manipulation look at the Oracle's output, and use a secure automated system to compute the Oracle's reward.
  3. But that would make the Oracle useless.
  4. Ah, but we can do that during only some of the episodes ("training" episodes), and have separate "use" episodes where we make sure no learning takes place, where we do look at the Oracle's output and make use of it.

Does this address your question/concern?

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-04T13:26:21.882Z · score: 3 (1 votes) · LW · GW

What is the advantage of making money this way, compared to making money by predicting the stock market?

Comment by wei_dai on Risks from Learned Optimization: Conclusion and Related Work · 2019-07-04T13:15:47.647Z · score: 5 (2 votes) · LW · GW

If mesa-optimizers are likely to occur in future AI systems by default, and there turns out to be some way of preventing mesa-optimizers from arising, then instead of solving the inner alignment problem, it may be better to design systems to not produce a mesa-optimizer at all.

I'm not sure I understand this proposal. If you prevent mesa-optimizers from arising, won't that drastically reduce the capability of the system that you're building (e.g., the resulting policy/model won't be able to do any kind of sophisticated problem solving to handle problems that don't appear in the training data). Are you proposing to instead manually design an aligned optimizer that would be competitive with the mesa-optimzer that would have been created?

Comment by wei_dai on Problems with Counterfactual Oracles · 2019-07-04T12:47:51.961Z · score: 5 (3 votes) · LW · GW

they seem to rely on winning a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure

I upvoted you, but this seems to describe AI safety as a whole. What isn't a game of engineering cleverness against a superintelligent mountain of otherwise-dangerous optimization pressure, in your view?

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-04T06:15:39.202Z · score: 4 (2 votes) · LW · GW

which sort of goes to show how gradient descent doesn’t distinguish between mesa-optimizers with single-episode and cross-episode objectives

Sorry I haven't followed the math here, but this seems like an important question to investigate independently of everything else in this thread. Maybe consider writing a post on it?

In the case of "actual" IDA, I guess the plan is for each overseer to look inside the model they're training, and penalize it for doing any unintended optimization (such as having cross-episode objectives). Although I'm not sure how that can happen at the lower levels where the overseers are not very smart.

Comment by wei_dai on Self-consciousness wants to make everything about itself · 2019-07-04T05:48:27.574Z · score: 23 (9 votes) · LW · GW

It's confusing that "tone argument" (in the OP) links to a Wikipedia article on "tone policing", if they're not supposed to be the same thing.

What is the actual relationship between tone arguments and tone policing? In the OP you wrote:

A tone argument criticizes an argument not for being incorrect, but for having the wrong tone.

From this it seems that tone arguments is the subset of tone policing that is aimed at arguments (as opposed to other forms of speech). But couldn't an argument constitute an implicit threat of violence, and therefore tone arguments could be good sometimes?

It seems like to address habryka's criticism, you're now redefining (or clarifying) "tone argument" to be a subset of the subset of tone policing that is aimed at arguments, namely where the aim of the policing is specifically claimed to be “helping you get more people to listen to you”. If that's the case, it seems good to be explicit about the redefinition/clarification to avoid confusing people even further.

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-04T05:26:03.659Z · score: 4 (2 votes) · LW · GW

Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?

I was assuming each "question" actually includes as much relevant history as we can gather about the world, to make the Oracle's job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that's not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn't include X.

My point about "if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case" still stands though, right?

Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.

You may well be right about this, but I'm not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-03T23:03:09.451Z · score: 4 (2 votes) · LW · GW

Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case.

Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well "on the current question" or "in the real world" or "on the actual questions that it gets"). That's because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn't improve performance on any particular given input. In the nested case, performance on given inputs do increase.

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-03T18:01:12.755Z · score: 5 (3 votes) · LW · GW

I’m not sure I understand the concern.

Yeah, I'm not sure I understand the concern either, hence the tentative way in which I stated it. :) I think your objection to my concern is a reasonable one and I've been thinking about it myself. One thing I've come up with is that with the nested queries, the higher level Oracles could use simulation warfare to make the lower level Oracles answer the way that they "want", whereas the same thing doesn't seem to be true in the sequential case (if we make it so that in both cases each Oracle cares about just performance on the current question).

Comment by wei_dai on Aligning a toy model of optimization · 2019-07-03T07:46:11.463Z · score: 4 (2 votes) · LW · GW

If dropping competitiveness, what counts as a solution?

I'm not sure, but mainly because I'm not sure what counts as a solution to your problem. If we had a specification of that, couldn't we just remove the parts that deal with competitiveness?

Is “imitate a human, but run it fast” fair game?

I guess not, because a human imitation might have selfish goals and not be intent aligned to the user?

We could try to hash out the details in something along those lines, and I think that’s worthwhile, but I don’t think it’s a top priority and I don’t think the difficulties will end up being that similar.

What about my suggestion of hashing the details of how to implement IDA/DEBATE using Opt and then seeing if we can decide whether or not it's aligned?

Comment by wei_dai on Raemon's Shortform · 2019-07-03T07:18:06.411Z · score: 6 (2 votes) · LW · GW

Note: These features do not seem to exist on GW. (Not that I miss them since I don't feel a need to use them myself.)

Questions: Is anyone using these features at all? Oh I see you said earlier "a couple people very briefly tried using them". Do you know why they stopped? Do you think you overestimated how many people would use it, in a way that could have been corrected (for example by surveying potential users or paying more attention to skeptical voices)? (To be fair, upon reviewing the comments on your Archipelago posts, there weren't that many skeptical voices, although I did upvote this one.) Given that you spend several months on Archipelago, it seems useful to do a quick postmortem on lessons learned?

Comment by wei_dai on steven0461's Shortform Feed · 2019-07-03T04:32:41.235Z · score: 5 (2 votes) · LW · GW

I haven't thought about it much, but it seems like the fraction of far future stuff isomorphic to the 21st century is probably fairly negligible from a purely utilitarian viewpoint, because the universe is so big that even using 1/1,000,000 of it for simulations would be a lot of simulations, and why would the far future want that many simulations of the 21st century? It doesn't seem like a good use of resources to do that many duplicate historical simulations in terms of either instrumental value or terminal value.

Comment by wei_dai on Problems with Counterfactual Oracles · 2019-07-02T17:32:32.253Z · score: 3 (1 votes) · LW · GW

Even worse, the (question, prediction, r=0) datapoints in the history could get so misleading that the oracle never learns anything.

Why do we have to give the oracle a zero reward for the non-erasure episodes? Why not just skip the learning/update step for those episodes?

In short, I think the CDT setup without machine learning assumes the problem already solved, that online learning won’t work and is unsafe, which leaves us with a “training then deployment” setup that isn’t really original.

Why not do a combination of pre-training and online learning, where you do enough during the training phase to get a useful predictor, and then use online learning to deal with subsequent distributional shifts?

Although I guess that probably isn't really original either. What seems original is that during any episode where learning will take place, don't let humans (or any other system that might be insecure against the oracle) see the oracle's output until the episode is over.

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T14:38:39.827Z · score: 5 (2 votes) · LW · GW

Where (under which assumption) would you suggest that people focus their efforts?

Also, what level of capability should we assume the Oracle to have, or which assumption about level of capability would you suggest that people focus their efforts on?

Your examples all seem to assume oracles that are superhumanly intelligent. If that's the level of capability we should target with our questions, should we assume that we got this Oracle through a local or distributed takeoff? In other words, does the rest of the world look more or less like today's or are there lots of other almost-as-capable AIs around?

ETA: The reason for asking these questions is that you're only giving one prize for each type of Oracle, and would probably not give the prize to a submission that assumes something you think is very unlikely. It seems good to communicate your background views so that people aren't surprised later when you don't pick them as winners due to this kind of reason.

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T14:15:46.909Z · score: 8 (5 votes) · LW · GW

First, if you’re willing to make the (very) strong assumption that you can directly specify what objective you want your model to optimize for without requiring a bunch of training data for that objective, then you can only provide a reward in the situation where all subquestions also have erasures.

But if all subquestions have erasures, humans would have to manually execute the whole query tree, which is exponentially large so you'll run out of resources (in the counterfactual world) if you tried to do that, so the Oracle won't be able to give you a useful prediction. Wouldn't it make more sense to have the Oracle make a prediction about a counterfactual world where some humans just think normally for a while and write down their thoughts (similar to my "predict the best AF posts" idea)? I don't see what value the IDA idea is adding here.

Second, I don’t think you’re entirely screwed even if you need training data, since you can do some relaxations that attempt to approximate the situation where you only provide rewards in the event of a complete erasure.

Given the above, "only provide rewards in the event of a complete erasure" doesn't seem to make sense as a target to approximate. Do you think your ideas in this paragraph still have value in light of that?

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T07:48:28.439Z · score: 6 (3 votes) · LW · GW

Question: are we assuming that mesa optimizer and distributional shift problems have been solved somehow? Or should we assume that some context shift might suddenly cause the Oracle to start giving answered that aren't optimized for the objective function that we have in mind, and plan our questions accordingly?

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-01T23:23:52.245Z · score: 8 (5 votes) · LW · GW

Is it safe to ask the Oracle a subquestion in the event of erasure? Aren't you risking having the Oracle produce an answer that is (in part) optimized to make it easier to predict the answer to the main question, instead of just the best prediction of how the human would answer that subquestion? (Sorry if this has already been addressed during a previous discussion of counterfactual oracles, because I haven't been following it closely.)

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-01T17:44:30.756Z · score: 8 (4 votes) · LW · GW

If that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.

Comment by wei_dai on Causal Reality vs Social Reality · 2019-07-01T16:31:02.610Z · score: 32 (11 votes) · LW · GW

I feel sympathy for both sides here. I think I personally am fine with both kinds of cultures, but sometimes kind of miss the more combative style of LW1, which I think can be fun and productive for a certain type of people (as evidenced by the fact that many people did enjoy participating on LW1 and it produced a lot of progress during its peak). I think in an ideal world there would be two vibrant LW2s, one for each conversational culture, because right now it's not clear where people who strongly prefer combat culture are supposed to go.

A nice signal that you cared about how I felt would have been that if after I’d said your bangs (!) felt condescending to me, you’d made an effort to reduce your usage rather than ramping them up to 11.

I think he might have been trying to signal that using lots of bangs is just his natural writing style, and therefore you needn't feel condescension as a result of them.

Comment by wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-01T15:56:49.036Z · score: 12 (7 votes) · LW · GW

Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.

This seems to potentially significantly accelerate AI safety research while being safe since it's just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn't secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.

Comment by wei_dai on Conceptual Problems with UDT and Policy Selection · 2019-07-01T13:50:30.789Z · score: 3 (1 votes) · LW · GW

Suppose you (as a human) are playing chicken against this version of UDT, which has vastly more computing power than you and could simulate your decisions in its proofs. Would you swerve?

I wouldn't, because I would reason that if I didn't swerve, UDT would simulate that and conclude that not swerving leads to the highest utility. You said "By deliberately crashing into the formerly smart madman, UDT can retroactively erase the situation." but I don't see how this version of UDT does that.

I don’t know what logical updatelessness means, and I don’t see where the article describes this

You're right, the post kind of just obliquely mentions it and assumes the reader already knows the concept, in this paragraph:

Both agents race to decide how to decide first. Each strives to understand the other agent’s behavior as a function of its own, to select the best policy for dealing with the other. Yet, such examination of the other needs to itself be done in an updateless way. It’s a race to make the most uninformed decision.

Not sure what's a good reference for logical updatelessness. Maybe try some of these posts? The basic idea is just that even if you manage to prove that your opponent doesn't swerve, you perhaps shouldn't "update" on that and then make your own decision while assuming that as a fixed fact that can't be changed.

Comment by wei_dai on Circle Games · 2019-07-01T11:45:53.979Z · score: 5 (2 votes) · LW · GW

You may be interested in my posts about WoW.

Thanks, they're interesting although the title "Everything I ever needed to know, I learned from World of Warcraft" promised a bit more than you've delivered so far. :) I'd be interested in other lessons you learned, especially ones that are more transferable to other situations (the Goodhart one was better in that regard than the loot system one).

My experience was similar. Leading raids, in particular, was excellent social-skills training.

Yeah, I imagine that must be the case for the guild/raid leaders, but don't see what the footsoldiers get out of it. (Aside from practicing to be footsoldiers, which most people don't really need more of?) I guess I'm hoping that MMGs can somehow deliver more learning opportunities for social/coordination skills than just giving a small number of people the chance to practice being low to mid-level managers.

Comment by wei_dai on Sam Harris and the Is–Ought Gap · 2019-07-01T07:39:35.000Z · score: 3 (1 votes) · LW · GW

This seems rather disappointing on Sam Harris's part, given that he indeed had training in philosophy (he has a B.A. in philosophy from Stanford, according to Wikipedia). If this post describes Harris's position correctly (I haven't read the source material), it seems to boil down to Harris saying that science can tell you what your instrumental goals/values should be, given your terminal goals/values, but it shouldn't be hard to see (or steelman) that when someone says "science can't bridge Hume’s is–ought gap" they're saying that science can't tell you what your terminal goals/values should be. It seems like either Harris couldn't figure out the relatively simple nature of the disagreement/misunderstanding, or that he could figure it out but deliberately chooses not to clarify/acknowledge it in order to keep marketing that he knows how "science can bridge Hume’s is–ought gap".

Comment by wei_dai on Aligning a toy model of optimization · 2019-07-01T04:29:24.487Z · score: 5 (3 votes) · LW · GW

I suggest as a first step, we should just aim for an uncompetitive aligned AI, one that might use a lot more training data, or many more invocations of Opt than the benchmark. (If we can't solve that, that seems fairly strong evidence that a competitive aligned AI is impossible or beyond our abilities. Or if someone proposes a candidate and we can't decide whether it's actually aligned or not, that would also be very useful strategic information that doesn't require the candidate to be competitive.)

Do you already have a solution to the uncompetitive aligned AI problem that you can sketch out? It sounds like you think iterated amplification or debate can be implemented using Opt (in an uncompetitive way), so maybe you can give enough details about that to either show that it is aligned or provide people a chance to find flaws in it?

Comment by wei_dai on Conceptual Problems with UDT and Policy Selection · 2019-07-01T00:29:22.046Z · score: 5 (2 votes) · LW · GW

Doing what you describe requires something like logical updatelessness, which UDT doesn't do, and which we don't know how to do in general. I think this was described in the post. Also, even if thinking more doesn't allow someone to exploit you, it might cause you to miss a chance to exploit someone else, or to cooperate with someone else, because it makes you too hard to predict.

Comment by wei_dai on Aligning a toy model of optimization · 2019-06-30T20:51:38.797Z · score: 5 (2 votes) · LW · GW

I think that when a design problem is impossible, there is often an argument for why it’s impossible.

My knowledge/experience is limited but I know in cryptography we generally can't find arguments for why it's impossible to design an algorithm to break a cipher, and have to rely on the fact that lots of smart people have tried to attack a cipher and nobody has succeeded. (Sometimes we can have a "security proof" in the sense of a reduction from the security of a cipher to a problem like factoring, but then we're still relying on the fact that lots of smart people have tried to design faster factoring algorithms and nobody has succeeded.) Note that this is only a P or co-NP problem, whereas determining that aligned AI is impossible is in P according to my analysis.

That said, it’s also not obvious that problems in NP are easier to solve than P, both contain problems you just can’t solve and so you are relying on extra structural facts in either case.

I guess that may be true in a raw compute sense, but there's also an issue of how do you convince people that you've solved a problem? Again in crypto,

  • NP problems (find a flaw in a cipher) get solved by individual researchers,
  • co-NP problems (determine a cipher is secure) is "solved" by people trying and failing to break a cipher,
  • P problems (design a fast cipher free from security flaws) get solved by government-sponsored contests that the whole field participates in, and
  • P problems (determine that any cipher faster than X can't be secure) are left unsolved. (The fact that people tried and failed to design a secure cipher faster than X is really weak evidence that it's impossible and of course nobody knows how to prove anything like this.)
Comment by wei_dai on Aligning a toy model of optimization · 2019-06-30T18:07:10.532Z · score: 3 (1 votes) · LW · GW

You don’t even need to explicitly maintain separate levels of agent. You just always use the current model to compute the rewards, and use that reward function to compute a gradient and update.

You're using current model to perform subtasks of "compute the reward for the current task(s) being trained" and then updating, and local optimization ensures the update will make the model better (or at least no worse) at the task being trained, but how do you know the update won't also make the model worse at the subtasks of "compute the reward for the current task(s) being trained"?

Is the answer something like, the current tasks being trained includes all previously trained tasks? But even then, it's not clear that as you add more tasks to the training set, performance on previously trained tasks won't degrade.

Comment by wei_dai on What's the best explanation of intellectual generativity? · 2019-06-30T14:16:21.116Z · score: 7 (3 votes) · LW · GW

We’ll need a different organizational form to permit the long view, I think.

I included "countries" in my original question and I think some countries (e.g., China) probably have the necessary long view, and probably wants to replicate Bell Labs (in, e.g., the Chinese Academy of Sciences), and must be missing some other element of what made it so successful.

Comment by wei_dai on Circle Games · 2019-06-30T14:04:06.991Z · score: 3 (1 votes) · LW · GW

Up to level 15 you play almost completely alone. Then you start being pushed into meeting some randos; maybe having joined one of their guilds by 30. By 50 you’re playing mostly with the same people, and by 60 you look forward to scheduled interactions with those people one to five times a week.

As I recall, at level 60, after you finish the 5-person content, you were forced into 40-person raids, where the amount of specialization/coordination/order-following required to make progress made it more like a tedious job than a game, at least for me. Curious if anyone has any insights into the design choice there, e.g., what was the thinking behind the end-game being 40-person raids, why wasn't there more of a ramp-up between the near-end-game and the actual end-game, did most WoW players not find it so tedious, etc.?

In theory it seems like massively multiplayer games would be a good way for people to develop/practice social/coordination skills, and I think WoW and MUDs before it did help me a lot in that regard. (Before, I was really anxious of talking to people.) But I'm not aware of any games that go beyond trying to coordinate 40-person raids, scaling into hundreds or thousands or more. (And as I mentioned, even the 40-person content was tedious to me.) I wonder if there is any way to make larger scale coordination fun.

Comment by wei_dai on What's the best explanation of intellectual generativity? · 2019-06-30T01:20:59.453Z · score: 7 (3 votes) · LW · GW

I guess part of the reason must be that AT&T was supporting Bell Labs with its monopoly profits, and that's part of the "secret sauce" that none of the post-split organizations could inherit. What about other monopoly-supported research labs (such as Microsoft Research) though, whose leaders must have Bells Labs in mind as a model? Seems like there's still something we don't understand?

Comment by wei_dai on What's the best explanation of intellectual generativity? · 2019-06-29T21:26:58.846Z · score: 7 (3 votes) · LW · GW

Does anyone know why Bell Labs didn't take over the (research) world, either by absorbing more and more researchers or by other organizations or countries copying its model?

Comment by wei_dai on Conceptual Problems with UDT and Policy Selection · 2019-06-29T21:12:42.691Z · score: 9 (4 votes) · LW · GW

I agree with Two Ways UDT Hasn’t Generalized and What UDT Wants, and am still digesting the other parts. (I think a lot of this is really fuzzy and hard to talk/write about, and I feel like giving you some positive reinforcement for making the attempt and doing as well as you are. :)

The race for most-meta is only one possible intuition about what UDT is trying to be.

It seems like one meta level above what even UDT tries to be is decision theory (as a philosophical subject) and one level above that is metaphilosophy, and my current thinking is that it seems bad (potentially dangerous or regretful) to put any significant (i.e., superhuman) amount of computation into anything except doing philosophy, partly for reasons given in this post.

To put it another way, any decision theory that we come up with might have some kind of flaw that other agents can exploit, or just a flaw in general, such as in how well it cooperates or negotiates with or exploits other agents (which might include how quickly/cleverly it can make the necessary commitments). Wouldn't it be better to put computation into trying to find and fix such flaws (in other words, coming up with better decision theories) than into any particular object-level decision theory, at least until the superhuman philosophical computation itself decides to start doing the latter?

Comment by wei_dai on Aligning a toy model of optimization · 2019-06-29T20:25:02.982Z · score: 5 (2 votes) · LW · GW

As one of the authors of the paper that introduced the idea of human analogues of computational complexity classes, which I've found to be really interesting and useful (see here for another place that I use it), I'm curious about your thoughts on the ways I've used it (e.g., am I misunderstanding the idea or misusing it) and whether you have any further thoughts about it yourself, such as what kinds of problems you expect to be outside the human equivalent of NP.

Comment by wei_dai on Aligning a toy model of optimization · 2019-06-29T06:30:48.754Z · score: 5 (2 votes) · LW · GW

I want this problem statement to stand relatively independently since I think it can be worked on relatively independently (especially if it ends up being an impossibility argument).

That makes sense. Are you describing it as a problem that you (or others you already have in mind such as people at OpenAI) will work on, or are you putting it out there for people looking for a problem to attack?

At each step of local search you have some current policy and you are going to produce a new one (e.g. by taking a gradient descent step, or by generating a bunch of perturbations). You can use the current policy to help define the objective for the next one, rather than needing to make a whole separate call to Opt.

So, something like, when training the next level agent in IDA, you initialize the model parameters with the current parameters rather than random parameters?

Comment by wei_dai on Aligning a toy model of optimization · 2019-06-29T02:43:32.743Z · score: 7 (3 votes) · LW · GW

(ETA: This comment may not make much sense unless the reader is familiar with section 2.2 of AI safety via debate.)

At the meta level, given Opt that we have to use as a black box, it seems like:

  1. Building an unaligned AI corresponds to P or at most NP
  2. Verifying that a flaw in a proposed aligned AI design actually is a flaw corresponds to P
  3. Checking whether an AI design is aligned (or has an alignment flaw) corresponds to NP or co-NP
  4. Designing an actually aligned AI corresponds to P
  5. Determining that aligned AI is impossible corresponds to P
  6. Determining whether there is an aligned AI that comes with a clear argument for it being aligned corresponds to NP
  7. Determining whether there is a clear argument for aligned AI being impossible corresponds to NP

Does that seem right? For the impossibility part you're proposing to do 7 but since the actual problem is closer to 5, it could easily be the case that aligned AI is impossible but there is no clear argument for it. (I.e., there's no short argument that can convince you and you have to do the P computation instead.) So I would think that if 6 is false then that actually is (probably) bad news.

Comment by wei_dai on Aligning a toy model of optimization · 2019-06-28T23:00:14.050Z · score: 21 (8 votes) · LW · GW

Most of this post seems to be simplified/streamlined versions of what you've written before. The following points seem to be new, and I have some questions:

Unfortunately, iterated amplification doesn’t correspond to optimizing a single objective— U it requires either training a sequence of agents or exploiting properties of local search (using the previous iterate to provide oversight for the next).

"training a sequence of agents" is bad because it might require multiple invocations of Opt so it's not competitive with an unaligned AI that uses Opt a small constant number of times?

Can you explain more how iterated amplification exploits properties of local search?

If we just have Opt, it’s not clear if we can efficiently do anything like iterated amplification or debate.

Is this because (or one way to think about it is) Opt corresponds to NP and iterated amplification or debate correspond to something higher in the polynomial hierarchy?

I described Opt as requiring n times more compute than U. If we implemented it naively it would instead cost times more than U.

You described Opt as returning the argmax for U using only n times more compute than U, without any caveats. Surely this isn't actually possible because in the worst case it does require times more than U? So the only way to be competitive with the Opt-based benchmark is to make use of Opt as a black box?

It should be easier to compete with this really slow AI. But it’s still not trivial and I think it’s worth working on.

Why is it easier? (If you treat them both as black boxes, the difficulty should be the same?) Is it because we don't have to treat the slow naive version of Opt as a black box that we have to make use of, and therefore there are more things we can do to try to be competitive with it?

If we can’t compete with this benchmark, I’d feel relatively pessimistic about aligning ML.

Why wouldn't just be impossible? Is it because ML occupies a different point on the speed/capability Pareto frontier and it might be easier to build an aligned AI near that point (compared to the point that the really slow AI occupies) ?

Comment by wei_dai on Being the (Pareto) Best in the World · 2019-06-26T00:01:46.525Z · score: 16 (5 votes) · LW · GW

It seems like a natural next step here is to talk about comparative advantage (whereas "being the best in the world" seems more analogous to absolute advantage), but I'm not sure how to think about comparative advantage in the "dimensionality" setting. (To be fair, comparative advantage seems hard to think about in general.) So I'll just throw this out and see if anyone else has any ideas.

Comment by wei_dai on A case for strategy research: what it is and why we need more of it · 2019-06-21T15:34:33.097Z · score: 20 (7 votes) · LW · GW

We especially encourage researchers to share their strategic insights and considerations in write ups and blog posts, unless they pose information hazards.

I've been doing quite a bit of this recently, and I'd love to other researchers do more of this:

However I haven't gotten much engagement from people who work on strategy professionally. I'm not sure if they just aren't following LW/AF, or don't feel comfortable discussing strategically relevant issues in public. So this kind of ties into my other comment, and is part of what I'm thinking about as I try to puzzle out how to move forward, both for myself and for others who may be interested in writing up their strategic insights and considerations.

Allan Dafoe, director of the Centre for the Governance of AI, has a different take

I'm not sure I understand what Allan is suggesting, but it feels pretty similar to what you're saying. Can you perhaps explain your understanding of how his take differs from yours?

Comment by wei_dai on A case for strategy research: what it is and why we need more of it · 2019-06-21T15:02:53.369Z · score: 26 (9 votes) · LW · GW

I was recently told that there's a "fair bit" of AI strategy/policy/governance research and discussion happening non-publicly (e.g., via Google docs) by people at places like FHI and OpenAI. Looking at the acknowledgements section of this post, it appears that the current authors are not very "plugged in" to those non-public discussions. I am in a similar situation in that I'm interested in AI strategy but am not "plugged in" to the existing discussions. It seems like there's a few different ways to go from here and I'm not sure which is best:

  1. Try to get "plugged in" to the non-public discussions.
  2. Assuming there's not serious info hazard concerns, try to make the current discussions more public, e.g., by pushing for the creation of a public forum for discussing strategy and inviting strategy researchers to participate.
  3. Try to create a parallel public strategy discussion.

My guess is that assuming resources (and info hazards) aren't an issue, 3 is best because different kinds of research/discussion setups create different biases and it's good to have diversity to avoid blind spots. (For example Bitcoin and UDT both came out of informal online discussion forums instead of academia/industry/government research institutions.) But:

  1. Are there enough people and funding to sustain a parallel public strategy research effort and discussion?
  2. Are there serious info hazards, and if so can we avoid them while still having a public discussion about the non-hazardous parts of strategy?

I'd be interested in the authors' (or other people's) thoughts on these questions.

Comment by wei_dai on For the past, in some ways only, we are moral degenerates · 2019-06-18T08:39:36.027Z · score: 5 (2 votes) · LW · GW

I can see two possible ways to convince me that moral realism is true:

  1. I spend hundreds or more years in a safe environment with a bunch of other philosophically minded people and we try to come up with arguments for and against moral realism, counterarguments, counter-counterarguments and so on, and we eventually exhaust the space of such arguments and reach a consensus that moral realism is true.
  2. We solve metaphilosophy, program/teach an AI to "do philosophy", somehow reach high confidence that we did that correctly, and the AI solves metaethics and gives us a convincing argument that moral realism is true.

Do these seem like things that could be "put in as a strong conditional meta-preference" in your framework?

Comment by wei_dai on Research Agenda v0.9: Synthesising a human's preferences into a utility function · 2019-06-18T08:12:34.268Z · score: 36 (9 votes) · LW · GW

So the first thing to do is to group the partial preferences together according to similarity (for example, preferences for concepts closely related in terms of webs of connotations should generally be grouped together), and generalise them in some regularised way. Generalise means, here, that they are transformed into full preferences, comparing all possible universes. [...] It seems that standard machine learning techniques should already be up to this task (with all the usual current problems).

I don't understand how this is even close to being possible today. For example I have some partial preferences that could generally be described as valuing the existence of positive conscious experiences, but I have no idea how to generalize this to full preferences, since I do not have a way to determine, given an arbitrary physical system, whether it contains a mind that is having a positive conscious experience. This seems like a very hard philosophical problem to solve, and I don't see how "standard machine learning techniques" could possibly be to up to this task.

The way I would approach this problem is to say that humans seem to have a way of trying to generalize (e.g., figure out what we really mean by "positive conscious experience") by "doing philosophy" or "applying philosophical reasoning", and if we better understood what we're doing when we "do philosophy" then maybe we can program or teach an AI to do that. See Some Thoughts on Metaphilosophy where I wrote down some recent thoughts along these lines.

I'm curious to know what your thinking is here, in more detail.

AGI will drastically increase economies of scale

2019-06-07T23:17:38.694Z · score: 40 (14 votes)

How to find a lost phone with dead battery, using Google Location History Takeout

2019-05-30T04:56:28.666Z · score: 51 (22 votes)

Where are people thinking and talking about global coordination for AI safety?

2019-05-22T06:24:02.425Z · score: 93 (25 votes)

"UDT2" and "against UD+ASSA"

2019-05-12T04:18:37.158Z · score: 42 (13 votes)

Disincentives for participating on LW/AF

2019-05-10T19:46:36.010Z · score: 68 (28 votes)

Strategic implications of AIs' ability to coordinate at low cost, for example by merging

2019-04-25T05:08:21.736Z · score: 49 (19 votes)

Please use real names, especially for Alignment Forum?

2019-03-29T02:54:20.812Z · score: 30 (10 votes)

The Main Sources of AI Risk?

2019-03-21T18:28:33.068Z · score: 63 (26 votes)

What's wrong with these analogies for understanding Informed Oversight and IDA?

2019-03-20T09:11:33.613Z · score: 37 (8 votes)

Three ways that "Sufficiently optimized agents appear coherent" can be false

2019-03-05T21:52:35.462Z · score: 68 (17 votes)

Why didn't Agoric Computing become popular?

2019-02-16T06:19:56.121Z · score: 53 (16 votes)

Some disjunctive reasons for urgency on AI risk

2019-02-15T20:43:17.340Z · score: 37 (10 votes)

Some Thoughts on Metaphilosophy

2019-02-10T00:28:29.482Z · score: 55 (15 votes)

The Argument from Philosophical Difficulty

2019-02-10T00:28:07.472Z · score: 47 (13 votes)

Why is so much discussion happening in private Google Docs?

2019-01-12T02:19:19.332Z · score: 86 (25 votes)

Two More Decision Theory Problems for Humans

2019-01-04T09:00:33.436Z · score: 58 (19 votes)

Two Neglected Problems in Human-AI Safety

2018-12-16T22:13:29.196Z · score: 77 (25 votes)

Three AI Safety Related Ideas

2018-12-13T21:32:25.415Z · score: 73 (26 votes)

Counterintuitive Comparative Advantage

2018-11-28T20:33:30.023Z · score: 73 (27 votes)

A general model of safety-oriented AI development

2018-06-11T21:00:02.670Z · score: 70 (23 votes)

Beyond Astronomical Waste

2018-06-07T21:04:44.630Z · score: 92 (40 votes)

Can corrigibility be learned safely?

2018-04-01T23:07:46.625Z · score: 73 (25 votes)

Multiplicity of "enlightenment" states and contemplative practices

2018-03-12T08:15:48.709Z · score: 93 (23 votes)

Online discussion is better than pre-publication peer review

2017-09-05T13:25:15.331Z · score: 12 (12 votes)

Examples of Superintelligence Risk (by Jeff Kaufman)

2017-07-15T16:03:58.336Z · score: 5 (5 votes)

Combining Prediction Technologies to Help Moderate Discussions

2016-12-08T00:19:35.854Z · score: 13 (14 votes)

[link] Baidu cheats in an AI contest in order to gain a 0.24% advantage

2015-06-06T06:39:44.990Z · score: 14 (13 votes)

Is the potential astronomical waste in our universe too small to care about?

2014-10-21T08:44:12.897Z · score: 25 (27 votes)

What is the difference between rationality and intelligence?

2014-08-13T11:19:53.062Z · score: 13 (13 votes)

Six Plausible Meta-Ethical Alternatives

2014-08-06T00:04:14.485Z · score: 42 (43 votes)

Look for the Next Tech Gold Rush?

2014-07-19T10:08:53.127Z · score: 39 (37 votes)

Outside View(s) and MIRI's FAI Endgame

2013-08-28T23:27:23.372Z · score: 16 (19 votes)

Three Approaches to "Friendliness"

2013-07-17T07:46:07.504Z · score: 20 (23 votes)

Normativity and Meta-Philosophy

2013-04-23T20:35:16.319Z · score: 12 (14 votes)

Outline of Possible Sources of Values

2013-01-18T00:14:49.866Z · score: 14 (16 votes)

How to signal curiosity?

2013-01-11T22:47:23.698Z · score: 21 (22 votes)

Morality Isn't Logical

2012-12-26T23:08:09.419Z · score: 19 (35 votes)

Beware Selective Nihilism

2012-12-20T18:53:05.496Z · score: 40 (44 votes)

Ontological Crisis in Humans

2012-12-18T17:32:39.150Z · score: 44 (48 votes)

Reasons for someone to "ignore" you

2012-10-08T19:50:36.426Z · score: 23 (24 votes)

"Hide comments in downvoted threads" is now active

2012-10-05T07:23:56.318Z · score: 18 (30 votes)

Under-acknowledged Value Differences

2012-09-12T22:02:19.263Z · score: 47 (50 votes)

Kelly Criteria and Two Envelopes

2012-08-16T21:57:41.809Z · score: 11 (8 votes)

Cynical explanations of FAI critics (including myself)

2012-08-13T21:19:06.671Z · score: 21 (32 votes)

Work on Security Instead of Friendliness?

2012-07-21T18:28:44.692Z · score: 37 (40 votes)

Open Problems Related to Solomonoff Induction

2012-06-06T00:26:10.035Z · score: 27 (28 votes)

List of Problems That Motivated UDT

2012-06-06T00:26:00.625Z · score: 28 (29 votes)

How can we ensure that a Friendly AI team will be sane enough?

2012-05-16T21:24:58.681Z · score: 10 (15 votes)

Neuroimaging as alternative/supplement to cryonics?

2012-05-12T23:26:28.429Z · score: 17 (18 votes)

Strong intutions. Weak arguments. What to do?

2012-05-10T19:27:00.833Z · score: 17 (19 votes)