Misconceptions about continuous takeoff 2019-10-08T21:31:37.876Z · score: 48 (19 votes)
A simple environment for showing mesa misalignment 2019-09-26T04:44:59.220Z · score: 60 (23 votes)
One Way to Think About ML Transparency 2019-09-02T23:27:44.088Z · score: 20 (8 votes)
Has Moore's Law actually slowed down? 2019-08-20T19:18:41.488Z · score: 11 (8 votes)
How can you use music to boost learning? 2019-08-17T06:59:32.582Z · score: 8 (4 votes)
A Primer on Matrix Calculus, Part 3: The Chain Rule 2019-08-17T01:50:29.439Z · score: 8 (3 votes)
A Primer on Matrix Calculus, Part 2: Jacobians and other fun 2019-08-15T01:13:16.070Z · score: 20 (9 votes)
A Primer on Matrix Calculus, Part 1: Basic review 2019-08-12T23:44:37.068Z · score: 20 (8 votes)
Matthew Barnett's Shortform 2019-08-09T05:17:47.768Z · score: 7 (5 votes)
Why Gradients Vanish and Explode 2019-08-09T02:54:44.199Z · score: 27 (14 votes)
Four Ways An Impact Measure Could Help Alignment 2019-08-08T00:10:14.304Z · score: 21 (25 votes)
Understanding Recent Impact Measures 2019-08-07T04:57:04.352Z · score: 17 (6 votes)
What are the best resources for examining the evidence for anthropogenic climate change? 2019-08-06T02:53:06.133Z · score: 11 (8 votes)
A Survey of Early Impact Measures 2019-08-06T01:22:27.421Z · score: 22 (7 votes)
Rethinking Batch Normalization 2019-08-02T20:21:16.124Z · score: 20 (6 votes)
Understanding Batch Normalization 2019-08-01T17:56:12.660Z · score: 19 (7 votes)
Walkthrough: The Transformer Architecture [Part 2/2] 2019-07-31T13:54:44.805Z · score: 9 (9 votes)
Walkthrough: The Transformer Architecture [Part 1/2] 2019-07-30T13:54:14.406Z · score: 30 (13 votes)


Comment by matthew-barnett on Misconceptions about continuous takeoff · 2019-10-14T06:30:13.411Z · score: 1 (1 votes) · LW · GW
My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem.

Especially because taking over the world requires you to be much better than other agents who want to stop you from taking over the world, which could very well include other AIs.

ETA: That said, upon reflection, there have been instances of people taking over large parts of the world without being superhuman. All world leaders qualify, and it isn't that unusual. However, what would be unusual is if someone wanted to take over the world and everyone else didn't want that yet it still happened.

Comment by matthew-barnett on Misconceptions about continuous takeoff · 2019-10-12T21:54:30.160Z · score: 1 (1 votes) · LW · GW

I agree with this, but when I said deployment I meant deployment of a single system, not several.

Comment by matthew-barnett on Misconceptions about continuous takeoff · 2019-10-12T21:53:05.748Z · score: 1 (1 votes) · LW · GW

I'm confused about why (1) and (2) are separate scenarios then. Perhaps because in (2) there's a lot of different types of AIs?

Comment by matthew-barnett on AI Safety "Success Stories" · 2019-10-11T19:07:14.067Z · score: 1 (1 votes) · LW · GW

For the alignment newsletter:

Planned summary: It is difficult to measure the usefulness of various alignment approaches without clearly understanding what type of future they end up being useful for. This post collects "Success Stories" for AI -- disjunctive scenarios in which alignment approaches are leveraged to ensure a positive future. Whether these scenarios come to pass will depend critically on background assumptions, such as whether we can achieve global coordination, or solve the most ambitious safety issues. Mapping these success stories can help us prioritize research.

Planned opinion: This post does not exhaust the possible success stories, but it gets us a lot closer to being able to look at a particular approach and ask, "Where exactly does this help us?" My guess is that most research ends up being only minimally helpful for the long run, and so I consider inquiry like this to be very useful for cause prioritization.

Comment by matthew-barnett on Misconceptions about continuous takeoff · 2019-10-11T01:49:09.853Z · score: 1 (1 votes) · LW · GW

I'm finding it hard to see how we could get (1) without some discontinuity?

When I think about why (1) would be true, the argument that comes to mind is that single AI systems will be extremely expensive to deploy, which means that only a few very rich entities could own them. However, this would contradict the general trend of ML being hard to train and easy to deploy. Unlike, say nukes, once you've trained your AI you can create a lot of copies and distribute them widely.

Comment by matthew-barnett on Misconceptions about continuous takeoff · 2019-10-11T01:39:13.249Z · score: 1 (1 votes) · LW · GW

It's worth noting that I wasn't using it as evidence "for" continuous takeoff. It was instead an example of something which experienced a continuous takeoff that nonetheless was quick relative to the lifespan of a human.

It's hard to argue that it wasn't continuous under my definition, since the papers got gradually and predictably better. Perhaps there was an initial discontinuity in 2014 when it first became a target? Regardless, I'm not arguing that this is a good model for AGI development.

Comment by matthew-barnett on Misconceptions about continuous takeoff · 2019-10-09T19:21:46.006Z · score: 2 (2 votes) · LW · GW
It might just join the ranks of "projects from before", and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike.

Admittedly, I did not explain this point well enough. What I meant to say was that before we have the first successful defection, we'll have some failed defection. If the system could indefinitely hide its own private intentions to later defect, then I would already consider that to be a 'successful defection.'

Knowing about a failed defection, we'll learn from our mistake and patch that for future systems. To be clear, I'm definitely not endorsing this as a normative standard for safety.

I agree with the rest of your comment.

Comment by matthew-barnett on Who lacks the qualia of consciousness? · 2019-10-06T22:25:35.987Z · score: 2 (2 votes) · LW · GW

When I think about what I find morally valuable about consciousness, I tend to think about rich experiences which are negative/positive, from the way that I rate them internally. An example of a negative value conscious experience is a sharp pain associated with being pricked by a pin. An example of a valuable conscious experience is the sensation of warmth associated with sitting near a fire during a cold winter day, together with the way that my brain processes the situation and enjoys it.

These things appears to me subtlety distinct from the feeling of inner awareness called 'consciousness' in this post.

Comment by matthew-barnett on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-06T05:45:15.412Z · score: 1 (1 votes) · LW · GW

once CO2 gets high enough, it starts impacting human cognition.

Do you have a citation for this being a big deal? I'm really curious whether this is a major harm over reasonable timescales (such as 100 years), as I don't recall ever hearing about it in an EA analysis of climate change. That said, I haven't looked very hard.

Comment by matthew-barnett on Concrete experiments in inner alignment · 2019-10-04T05:58:29.037Z · score: 2 (2 votes) · LW · GW

Planned summary:

This post lays out several experiments that could clarify the inner alignment problem: the problem of how to get an ML model to be robustly aligned with the objective function it was trained with. One example experiment is giving an RL trained agent direct access to its reward as part of its observation. During testing, we could try putting the model in a confusing situation by altering its observed reward so that it doesn't match the real one. The hope is that we could gain insight into when RL trained agents internally represent 'goals' and how they relate to the environment, if they do at all. You'll have to read the post to see all the experiments.

Planned opinion:

I'm currently convinced that doing empirical work right now will help us understand mesa optimization, and this was one of the posts that lead me to that conclusion. I'm still a bit skeptical that current techniques are sufficient to demonstrate the type of powerful learned search algorithms which could characterize the worst outcomes for failures in inner alignment. Regardless, I think at this point classifying failure modes is quite beneficial, and conducting tests like the ones in this post will make that a lot easier.

Comment by matthew-barnett on Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More · 2019-10-04T05:30:41.865Z · score: 13 (7 votes) · LW · GW

It seems to me that the intention is that solar radiation management is a solution that sounds good without actually being good. That is, it's an easy sell for fossil fuel corporations who have an interest in providing simple solutions to the problem rather than actually removing the root cause and thus solving the issue completely. I have little idea if this argument is actually true.

Comment by matthew-barnett on FB/Discord Style Reacts · 2019-10-03T20:37:02.746Z · score: 10 (3 votes) · LW · GW
If they leave an "unclear" react, I can't ignore that nearly as easily -- wait, which point was unclear? What are other people potentially missing that I meant to convey? Come back, anon!

Maybe there should be an option that allows you to highlight a part of the comment and react to that part in particular.

Comment by matthew-barnett on World State is the Wrong Level of Abstraction for Impact · 2019-10-02T03:51:42.669Z · score: 5 (3 votes) · LW · GW
So my feeling is that in order to actually implement an AI that does not cause bad kinds of high impact, we would need to make progress on value learning

Optimizing for a 'slightly off' utility function might be catastrophic, and therefore the margin for error for value learning could be narrow. However, it seems plausible that if your impact measurement used slightly incorrect utility functions to define the auxiliary set, this would not cause a similar error. Thus, it seems intuitive to me that you would need less progress on value learning than a full solution for impact measures to work.

From the AUP paper,

one of our key findings is that AUP tends to preserve the ability to optimize the correct reward function even when the correct reward function is not included in the auxiliary set.
Comment by matthew-barnett on A simple environment for showing mesa misalignment · 2019-09-28T21:22:36.315Z · score: 1 (1 votes) · LW · GW

Thanks. :)

Comment by matthew-barnett on Attainable Utility Theory: Why Things Matter · 2019-09-28T21:10:07.353Z · score: 7 (4 votes) · LW · GW
The problem with the first part of this sequence is that it can seem... obvious... until you realize that almost all prior writing about impact has not even acknowledged that we want the AI to leave us able to get what we want (to preserve our attainable utility).

Agreed. This has been my impression from reading previous work on impact.

Comment by matthew-barnett on Honoring Petrov Day on LessWrong, in 2019 · 2019-09-27T03:30:14.862Z · score: 24 (9 votes) · LW · GW

If one nation is confident that a rival nation will not retaliate in a nuclear conflict, then the selfish choice is to strike. By refusing orders, Petrov was being the type of agent who would not retaliate in a conflict. Therefore, in a certain sense, by being that type of agent, he arguably raised the risk of a nuclear strike. [Note: I still think his decision to not retaliate was the correct choice]

Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-27T03:04:20.583Z · score: 10 (6 votes) · LW · GW

The case for studying mesa optimization

Early elucidations of the alignment problem focused heavily on value specification. That is, they focused on the idea that given a powerful optimizer, we need some way of specifying our values so that the powerful optimizer can create good outcomes.

Since then, researchers have identified a number of additional problems besides value specification. One of the biggest problems is that in a certain sense, we don't even know how to optimize for anything, much less a perfect specification of human values.

Let's assume we could get a utility function containing everything humanity cares about. How would we go about optimizing this utility function?

The default mode of thinking about AI right now is to train a deep learning model that performs well on some training set. But even if we were able to create a training environment for our model that reflected the world very well, and rewarded it each time it did something good, exactly in proportion to how good it really was in our perfect utility function... this still would not be guaranteed to yield a positive artificial intelligence.

This problem is not a superficial one either -- it is intrinsic to the way that machine learning is currently accomplished. To be more specific, the way we constructed our AI was by searching over some class of models , and selecting those models which tended to do well on the training set. Crucially, we know almost nothing about the model which eventually gets selected. The most we can say is that our AI , but since was such a broad class, this provides us very little information about what the model is actually doing.

This is similar to the mistake evolution made when designing us. Unlike evolution, we can at least put some hand-crafted constraints, like a regularization penalty, in order to guide our AI into safe regions of . We can also open up our models and see what's inside, and in principle simulate every aspect of their internal operations.

But now this still isn't looking very good, because we barely know anything about what type of computations are safe. What would we even look for? To make matters worse, our current methods for ML transparency are abysmally ill equipped to the task of telling us what is going on inside.

The default outcome of all of this is that eventually, as grows larger with compute becoming cheaper and budgets getting bigger, gradient descent is bound to hit powerful optimizers who do not share our values.

Comment by matthew-barnett on A simple environment for showing mesa misalignment · 2019-09-26T19:49:25.698Z · score: 1 (1 votes) · LW · GW
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy.

I agree. That's why I think we should compare it to a hard-coded agent that pursues the optimal policy for collecting keys, and an agent that pursues the optimal policy for opening chests. If the trained agent is similar to the first hard-coded agent rather than the second, this would be striking evidence of misalignment.

Comment by matthew-barnett on Build a Causal Decision Theorist · 2019-09-24T20:57:53.168Z · score: 1 (1 votes) · LW · GW

I think it's worth pointing out that I agree that you can't change math. I don't think I can change math. Yet, I would still cooperate. The whole thing about whether we can literally change math is missing a crux. Thankfully, logical counterfactuals are not construed in such a silly way.

This is similar to the debate over whether free will exists when physics is deterministic. "You can't change the future. It is already fixed..." the poor soul said, before walking off a cliff.

Comment by matthew-barnett on Build a Causal Decision Theorist · 2019-09-24T00:43:33.180Z · score: 3 (2 votes) · LW · GW
I also think that nothing we do optimizes causally inaccessible areas of the multiverse

If that's the case, then I assume that you defect in the twin prisoner's dilemma. After all, under your reasoning, your action is independent of your twin because you are causally disjoint. This is true even despite the fact that you are both running identical decision procedures.

Now, if it's the case that you would defect in the twin prisoner's dilemma, I recommend writing up a more general critique of FDT or even EDT, because both recommend cooperating. That would probably better reflect the heart of your view on decision theory.

First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.

I agree that starting with the assumption of avoiding catastrophe is good, but when we could quite literally lose almost all the available value that we could potentially create by switching to CDT, don't you think that's at least worth looking into? On the flip side, I don't yet see why creating a CDT agent avoids catastrophe better than FDT.

The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing.

I agree :). This is the problem with theories of counterpossible reasoning. However, it's not clear that this is more of a problem for FDT than for CDT. After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals. Physics is either random or deterministic (unless we are mistaken about reality), and in neither case are there real mind independent counterfactuals. Whether or not you take an action is just a fact about the environment.

So, there is no greater problem for FDT; it's just a different problem, and perhaps not even very different. Which is not to say that it's not a big issue -- that's why MIRI is working on it.

Comment by matthew-barnett on Build a Causal Decision Theorist · 2019-09-23T23:35:29.774Z · score: 1 (1 votes) · LW · GW
what are the non-causal, logical consequences of building a CDT AGI?

As stated elsewhere in these comments, I think multiverse cooperation is pretty significant and important. And of course, I am also just concerned with normal Newcomblike dilemmas which might occur around the development of AI, when we can actually run its code to predict its behavior. On the other hand, there seems to me to be no upside to running CDT rather than FDT, conditional on us solving all of the problems with FDT.

Comment by matthew-barnett on Build a Causal Decision Theorist · 2019-09-23T23:28:30.011Z · score: 1 (1 votes) · LW · GW
We might as well say that agent's don't self-modify; all they can do is cause other agents to come into being and shut themselves off.

I agree this is helpful to imagine.

The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner's dilemma), but after a short period of time, it won't matter how it behaves.

It depends on the scope of the dilemma you are concerned with. As in the case of multiverse cooperation, the CDT agent will never leave it, nor will any of its successors. So, if we built a CDT agent, we could never obtain utility from causally disjoint areas of the multiverse, except by accident. If you hold the view (as I do) that the multiverse contains most of the potential value we could create, this could be a catastrophic loss of utility!

Comment by matthew-barnett on Build a Causal Decision Theorist · 2019-09-23T23:06:55.797Z · score: 1 (1 votes) · LW · GW

It could just be that you have a preference for CDT, as you wrote "Son-of-CDT is the agent with the source code such that running that source code is the best way to convert [energy + hardware + actuators + sensors] into utility." This is not true if you consider logical counterfactuals. But if you were only concerned about affecting the future via analyzing causal counterfactuals, then what you wrote would be accurate.

Personally, I think FDT performs better, not simply because I'd want to precommit to being FDT, but instead because I think it is better philosophically to consider logical counterfactuals rather than causal counterfactuals.

Comment by matthew-barnett on Build a Causal Decision Theorist · 2019-09-23T22:50:07.667Z · score: 1 (1 votes) · LW · GW

I said probably because CDT could self modify into an FDT agent (if that was desirable) but it could also modify into some other agent that took different choices than an FDT agent. Consider if CDT was born into a twin prisoner's dilemma and was allowed to self modify. I don't see a reason why it would self modify into FDT and therefore cooperate in that instance.

Comment by matthew-barnett on Build a Causal Decision Theorist · 2019-09-23T21:05:51.000Z · score: 1 (1 votes) · LW · GW

Son-of-CDT would probably make the wrong choices for Newcomblike problems that its parent, CDT, was "born into." This is because it has no incentive to actually make Son-of-CDT make the right choices in any dilemma that it already being confronted with. One example of a Newcomblike problem which we are all born into is multiverse-wide Prisoner's Dilemmas.

Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-20T23:56:23.832Z · score: 7 (4 votes) · LW · GW

Signal boosting a Lesswrong-adjacent author from the late 1800s and early 1900s

Via a friend, I recently discovered the zoologist, animal rights advocate, and author J. Howard Moore. His attitudes towards the world reflect contemporary attitudes within effective altruism about science, the place of humanity in nature, animal welfare, and the future. Here are some quotes which readers may enjoy,

Oh, the hope of the centuries and the centuries and centuries to come! It seems sometimes that I can almost see the shining spires of that Celestial Civilisation that man is to build in the ages to come on this earth—that Civilisation that will jewel the land masses of this planet in that sublime time when Science has wrought the miracles of a million years, and Man, no longer the savage he now is, breathes Justice and Brotherhood to every being that feels.

But we are a part of Nature, we human beings, just as truly a part of the universe of things as the insect or the sea. And are we not as much entitled to be considered in the selection of a model as the part 'red in tooth and claw'? At the feet of the tiger is a good place to study the dentition of the cat family, but it is a poor place to learn ethics.

Nature is the universe, including ourselves. And are we not all the time tinkering at the universe, especially the garden patch that is next to us—the earth? Every time we dig a ditch or plant a field, dam a river or build a town, form a government or gut a mountain, slay a forest or form a new resolution, or do anything else almost, do we not change and reform Nature, make it over again and make it more acceptable than it was before? Have we not been working hard for thousands of years, and do our poor hearts not almost faint sometimes when we think how far, far away the millennium still is after all our efforts, and how long our little graves will have been forgotten when that blessed time gets here?

The defect in this argument is that it assumes that the basis of ethics is life, whereas ethics is concerned, not with life, but with consciousness. The question ever asked by ethics is not, Does the thing live? but. Does it feel? It is impossible to do right and wrong to that which is incapable of sentient experience. Ethics arises with consciousness and is coextensive with it. We have no ethical relation to the clod, the molecule, or the scale sloughed off from our skin on the back of our hand, because the clod, the molecule, and the scale have no feeling, no soul, no anything rendering them capable of being affected by us [...] The fact that a thing is an organism, that it has organisation, has in itself no more ethical significance than the fact that it has symmetry, or redness, or weight.

In the ideal universe the life and happiness of no being are contingent on the suffering and death of any other, and the fact that in this world of ours life and happiness have been and are to-day so commonly maintained by the infliction of misery and death by some beings on others is the most painful fact that ever entered an enlightened mind.
Comment by matthew-barnett on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning · 2019-09-16T23:27:30.016Z · score: 4 (3 votes) · LW · GW
The period during which we use the regularizer is "training".

This is misleading. In the case of mesa optimization, there are plausibly two regimes which could both legitimately be considered "training." To use the analogy of human evolution, there is the training regime which allowed human brains to evolve, and there's the training regime where humans grow up from babies to adults. Both of these use some sort of learning algorithm, but the second uses the learned learning algorithms.

In the case of "Impact as a regularizer" I meant it in the sense of a baby growing up to be an adult. Whereas the safety protocol was meant to imply safety over the regime of developing brains to begin with.

Comment by matthew-barnett on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning · 2019-09-16T23:19:15.257Z · score: 1 (1 votes) · LW · GW

Also to note: I think I could have been substantially more clear in that post specifically. I wrote it during a brainstorming session where I was trying to think of ways an impact measure could help alignment. I will consider adding a note on the post saying "I don't currently think about impact measures this way anymore."

Comment by matthew-barnett on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning · 2019-09-16T23:15:03.086Z · score: 1 (1 votes) · LW · GW

[EDIT: See my other comment which explains my reply much better] You're right to say that they are very similar. The only real difference is in the conceptual framing. In the safety protocol case I imagined creating a testing environment, which may include potentially misaligned mesa optimizers. The case of regularizer is one where we have given it autonomy and it is no longer in a regime for us to perform tests on.

Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-16T23:06:00.638Z · score: 19 (6 votes) · LW · GW

There's a phenomenon I currently hypothesize to exist where direct attacks on the problem of AI alignment are criticized much more often than indirect attacks.

If this phenomenon exists, it could be advantageous to the field in the sense that it encourages thinking deeply about the problem before proposing solutions. But it could also be bad because it disincentivizes work on direct attacks to the problem (if one is criticism averse and would prefer their work be seen as useful).

I have arrived at this hypothesis from my observations: I have watched people propose solutions only to be met with immediate and forceful criticism from others, while other people proposing non-solutions and indirect analyses are given little criticism at all. If this hypothesis is true, I suggest it is partly or mostly because direct attacks on the problem are easier to defeat via argument, since their assumptions are made plain

If this is so, I consider it to be a potential hindrance on thought, since direct attacks are often the type of thing that leads to the most deconfusion -- not because the direct attack actually worked, but because in explaining how it failed, we learned what definitely doesn't work.

Comment by matthew-barnett on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning · 2019-09-16T19:56:21.599Z · score: 1 (1 votes) · LW · GW
How well do people perform on said dataset?

You can download the dataset here and see how well you can classify them yourself.

Comment by matthew-barnett on [AN #64]: Using Deep RL and Reward Uncertainty to Incentivize Preference Learning · 2019-09-16T17:18:29.924Z · score: 5 (3 votes) · LW · GW
I don't really get how the regularizer is supposed to be different from these two cases -- perhaps the idea is that it is a regularizer specifically on the distribution over utility functions that the AI is optimizing?

To clarify, the difference was that in the case of the regularizer, the impact penalty would gradually be discarded as the AI learned more. In the case of an influence limiter, the AI would always be restricted to low impact actions, until it is retired.

Comment by matthew-barnett on Are there technical/object-level fields that make sense to recruit to LessWrong? · 2019-09-15T22:50:05.952Z · score: 7 (5 votes) · LW · GW

I was curious what fields of science were underrepresented here, so I googled "List of sciences" and this Wikipedia article comes up.

It lists as part of formal sciences: decision theory, logic, mathematics, statistics, systems theory and computer science. All of those are well represented on Lesswrong, with the exception of systems theory.

Of the natural sciences, it lists physics, chemistry, Earth science, ecology, oceanography, geography, meteorology, astronomy, and biology. While physics and biology are talked about quite a bit here, pretty much none of the other natural sciences are.

Comment by matthew-barnett on crabman's Shortform · 2019-09-15T05:38:36.531Z · score: 5 (2 votes) · LW · GW

By soon I mean 5 years. Interestingly, I have a slightly higher probability that it will be solved within 20 years, which highlights the difficulty of saying ambiguous things like "soon."

Comment by matthew-barnett on crabman's Shortform · 2019-09-14T23:38:30.628Z · score: 2 (2 votes) · LW · GW

Any guesses at the difficulty? My first impression was that this is not going to be solved any time soon. I just don't think current techniques are good enough to write flawless lean code given a difficult objective.

I think grand challenges in AI are are sometimes useful, but when they are at this level I am a bit pessimistic. I don't think this is necessarily AI-complete, but it's perhaps close.

Comment by matthew-barnett on Habryka's Shortform Feed · 2019-09-13T07:20:07.805Z · score: 7 (4 votes) · LW · GW

[ETA: This isn't a direct reply to the content in your post. I just object to your framing of impact measures, so I want to put my own framing in here]

I tend to think that impact measures are just tools in a toolkit. I don't focus on arguments of the type "We just need to use an impact measure and the world is saved" because this indeed would be diverting attention from important confusion. Arguments for not working on them are instead more akin to saying "This tool won't be very useful for building safe value aligned agents in the long run." I think that this is probably true if we are looking to build aligned systems that are competitive with unaligned systems. By definition, an impact penalty can only limit the capabilities of a system, and therefore does not help us to build powerful aligned systems.

To the extent that they meaningfully make cognitive reductions, this is much more difficult for me to analyze. On one hand, I can see a straightforward case for everyone being on the same page when the word "impact" is used. On the other hand, I'm skeptical that this terminology will meaningfully input into future machine learning research.

The above two things are my main critiques of impact measures personally.

Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-13T03:23:35.477Z · score: 9 (7 votes) · LW · GW

I agree with Wei Dai that we should use our real names for online forums, including Lesswrong. I want to briefly list some benefits of using my real name,

  • It means that people can easily recognize me across websites, for example from Facebook and Lesswrong simultaneously.
  • Over time my real name has been stable whereas my usernames have changed quite a bit over the years. For some very old accounts, such as those I created 10 years ago, this means that I can't remember my account name. Using my real name would have averted this situation.
  • It motivates me to put more effort into my posts, since I don't have any disinhibition from being anonymous.
  • It often looks more formal than a silly username, and that might make people take my posts more seriously than they otherwise would have.
  • Similar to what Wei Dai said, it makes it easier for people to recognize me in person, since they don't have to memorize a mapping from usernames to real names in their heads.

That said, there are some significant downsides, and I sympathize with people who don't want to use their real names.

  • It makes it much easier for people to dox you. There are some very bad ways that this can manifest.
  • If you say something stupid, your reputation is now directly on the line. Some people change accounts every few years, as they don't want to be associated with the stupid person they were a few years ago.
  • Sometimes disinhibition from being anonymous is a good way to spur creativity. I know that I was a lot less careful in my previous non-real-name accounts, and my writing style was different -- perhaps in a way that made my writing better.
  • Your real name might sound boring, whereas your online username can sound awesome.
Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-12T08:07:28.213Z · score: 2 (2 votes) · LW · GW
Children aren't little seeds of goodness that just need to be set free. They are more like little seeds of anything

Toddlers, yes. If we're talking about people over the age of say, 8, then it becomes less true. By the time they are a teen, it becomes pretty false. And yet people still say that legal separation at 18 is good.

If you are merely making the argument that we should limit their exposure to things that could influence them in harmful directions, then I'd argue that this never stops being a powerful force, including for people well into adulthood and in old age.

Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-12T02:36:16.911Z · score: 5 (3 votes) · LW · GW

I've heard a surprising number of people criticize parenting recently using some pretty harsh labels. I've seen people call it a form of "Stockholm syndrome" and a breach of liberty, morally unnecessary etc. This seems kind of weird to me, because it doesn't really match my experience as a child at all.

I do agree that parents can sometimes violate liberty, and so I'd prefer a world where children could break free from their parents without penalties. But I also think that most children genuinely love their parents and so wouldn't want to do so. I think if you deride this as merely "Stockholm syndrome" then you are unfairly undervaluing the genuine nature of the relationship in most cases, and I disagree with you here.

As an individual, I would totally let an intent aligned AGI manage most of my life, and give me suggestions. Of course, if I disagreed with a course of action it suggested, I would want it to give a non-manipulative argument to persuade me that it knows best, rather than simply forcing me into the alternative. In other words, I'd want some sort of weak paternalism on the part of an AGI.

So, as a person who wants this type of thing, I can really see the merits of having parents who care for children. In some ways they are intent aligned GIs. Now, some parents are much more strict, and freedom restricting, and less transparent than what we would want in a full blown guardian superintelligence -- but this just seems like an argument that there exist bad parents, not that this type of paternalism is bad.

Comment by matthew-barnett on What are the merits of signing up for cryonics with Alcor vs. with the Cryonics Institute? · 2019-09-11T19:15:28.730Z · score: 10 (4 votes) · LW · GW

Here's a previous discussion of this question.

Comment by matthew-barnett on Hackable Rewards as a Safety Valve? · 2019-09-10T23:46:49.727Z · score: 5 (3 votes) · LW · GW
The root issue is that Reward ≠ Utility

Agreed. When I wrote I meant it as shorthand for , though now that I look at it I can see that was criss-crossing between reward and utility in a very confusing way.

That leads to the sort of problem I mentioned above, where the agent doesn't realize it's embedded in the environment and "accidentally" self-modifies.

That makes sense now, although I am still curious whether there is a case where it purposely self modifies rather than accidentally does so.

Comment by matthew-barnett on "AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 · 2019-09-10T23:03:10.516Z · score: 1 (1 votes) · LW · GW

I agree that this argument doesn't automatically imply that there will be a discontinuous jump in profitability of a single system, so it makes sense that the agents created might be just barely more competitive than other agents before it. However, it does imply that by the time we get the hardware necessary to do this, we will have a lot of economic power sitting in our machines by virtue of having a ton of computing power to run the agents.

Comment by matthew-barnett on Hackable Rewards as a Safety Valve? · 2019-09-10T22:41:17.577Z · score: 3 (2 votes) · LW · GW
Maybe people are just imagining "wireheading" in the form of seizing an external reward channel?

Admittedly, that's how I understood it. I don't see why an expected utility maximizer would modify its utility function, since utility functions are reflectively stable.

Comment by matthew-barnett on Hackable Rewards as a Safety Valve? · 2019-09-10T22:04:53.553Z · score: 1 (1 votes) · LW · GW
I'm not yet seeing what qualitative difference between the expectation and utility operators would make a wireheading AI modify one but not the other.

If we are modeling the agent as taking , then it would easily see that manually setting its reward channel to the maximum would be the best policy. However, it wouldn't see that setting its expectation value to 100% would be the best policy since that doesn't actually increase its reward. [ETA: Assuming its utility function is such that a higher reward = higher utility. Also, I meant not ].

Comment by matthew-barnett on "AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 · 2019-09-10T21:54:23.707Z · score: 4 (2 votes) · LW · GW

It seems like to me that if AGI is eventually created via this paradigm, hardware overhang is almost guaranteed. That's because it requires meta-learning and the initialization of rich environments to train populations of agents, which is extremely computationally expensive compared to running the actual agents themselves. Therefore, if AGI is obtained this way, we could rather easily re-purpose the hardware used for training in order to run a large number of these agents, which could correspondingly perform vast amounts of economic labor relative to the human population.

Comment by matthew-barnett on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence · 2019-09-10T21:44:24.496Z · score: 6 (4 votes) · LW · GW

I'm unsure whether you are drawing attention to the word "sample." If so, sample efficiency refers to the amount of experience an RL agent needs in order to perform well in an environment. See here.

Comment by matthew-barnett on What are some podcasts that just read aloud worthwhile content? · 2019-09-09T21:21:01.312Z · score: 11 (3 votes) · LW · GW

Robert Miles reads the Alignment Newsletter out loud.

Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-07T22:23:10.828Z · score: 10 (5 votes) · LW · GW

When I look back at things I wrote a while ago, say months back, or years ago, I tend to cringe at how naive many of my views were. Faced with this inevitable progression, and the virtual certainty that I will continue to cringe at views I now hold, it is tempting to disconnect from social media and the internet and only comment when I am confident that something will look good in the future.

At the same time, I don't really think this is a good attitude for several reasons:

  • Writing things up forces my thoughts to be more explicit, improving my ability to think about things
  • Allowing my ideas to be critiqued allows for a quicker transition towards correct beliefs
  • I tend to learn a lot when writing things
Comment by matthew-barnett on Matthew Barnett's Shortform · 2019-09-07T00:06:08.885Z · score: 3 (2 votes) · LW · GW

After writing the post on using transparency regularization to help make neural networks more interpretable, I have become even more optimistic that this is a potentially promising line of research for alignment. This is because I have noticed that there are a few properties about transparency regularization which may allow it to avoid some pitfalls of bad alignment proposals.

To be more specific, in order for a line of research to be useful for alignment, it helps if

  • The line of research doesn't require unnecessarily large amounts of computations to perform. This would allow the technique to stay competitive, reducing the incentive to skip safety protocols.
  • It doesn't require human models to work. This is useful because
    • Human models are blackboxes and are themselves mesa-optimizers
    • We would be limited primarily to theoretical work in the present, since human cognition is expensive to obtain.
  • Each part of the line of research is recursively legible. That is, if we use the technique on our ML model, we should expect that the technique itself can be explained without appealing to some other black box.

Transparency regularization meets these three criterion respectively, because

  • It doesn't need to be astronomically more expensive than more typical forms of regularization
  • It doesn't necessarily require human-level cognitive parts to get working.
  • It is potentially quite simple mathematically, and so definitely meets the recursively legible criterion.
Comment by matthew-barnett on Optimization Provenance · 2019-09-06T23:29:58.813Z · score: 1 (1 votes) · LW · GW
In particular, the goal is to create a notion of transparency strong enough that an attempted deception would be completely transparent.

Is the idea here that

  • Creating a version of transparency this strong would enable us to mitigate less extreme forms of mesa misalignment ie. This is a strong enough form of transparency that it 'covers' the other cases automatically.
  • Deception should be treated separately from other forms of mesa misalignment.