The "AI Debate" Debate 2020-07-02T10:16:23.553Z
Predicted Land Value Tax: a better tax than an unimproved land value tax 2020-05-27T13:40:04.092Z
How important are MDPs for AGI (Safety)? 2020-03-26T20:32:58.576Z
Curiosity Killed the Cat and the Asymptotically Optimal Agent 2020-02-20T17:28:41.955Z
Pessimism About Unknown Unknowns Inspires Conservatism 2020-02-03T14:48:14.824Z
Build a Causal Decision Theorist 2019-09-23T20:43:47.212Z
Utility uncertainty vs. expected information gain 2019-09-13T21:09:52.450Z
Just Imitate Humans? 2019-07-27T00:35:35.670Z
IRL in General Environments 2019-07-10T18:08:06.308Z
Not Deceiving the Evaluator 2019-05-08T05:37:59.674Z
Value Learning is only Asymptotically Safe 2019-04-08T09:45:50.990Z
Asymptotically Unambitious AGI 2019-03-06T01:15:21.621Z
Impact Measure Testing with Honey Pots and Myopia 2018-09-21T15:26:47.026Z


Comment by michaelcohen on Any work on honeypots (to detect treacherous turn attempts)? · 2020-11-12T19:41:09.353Z · LW · GW

I don't know of any serious work on it. I did have an idea regarding honeypots a little while ago here.

Comment by michaelcohen on Introduction To The Infra-Bayesianism Sequence · 2020-09-21T14:38:05.719Z · LW · GW

The results I prove assume realizability, and some of the results are about traps, but independent of the results, the algorithm for picking actions resembles infra-Bayesianism. So I think we're taking similar objects and proving very different sorts of things.

Comment by michaelcohen on Introduction To The Infra-Bayesianism Sequence · 2020-09-01T12:36:49.556Z · LW · GW

Looks like we've been thinking along very similar lines!

Comment by michaelcohen on Introduction To The Infra-Bayesianism Sequence · 2020-09-01T12:10:43.659Z · LW · GW
the maximin expected values of saying "heads" when the coin comes up heads, and saying "tails" when the coin comes up heads, are unequal

I don't follow. Isn't the maximin value 0 for both?

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-09T20:28:29.788Z · LW · GW
Would it count if a malicious actor successfully finetuned GPT-3 to e.g. incite violence while maintaining plausible deniability?

Yes, that would count. I suspect that many "unskilled workers" would (alone) be better at inciting violence while maintaining plausible deniability than GPT-N at the point in time the leading group had AGI. Unless it's OpenAI, of course :P

Regarding intentionality, I suppose I didn't clarify the precise meaning of "better at", which I did take to imply some degree of intentionality, or else I think "ends up" would have been a better word choice. The impetus for this point was Paul's concern that someone would have used an AI to kill you to take your money. I think we can probably avoid the difficulty of a rigorous definition intentionality, if we gesture vaguely at "the sort of intentionality required for that to be viable"? But let me know if more precision would be helpful, and I'll try to figure out exactly what I mean. I certainly don't think we need to make use of a version of intentionality that requires human-level reasoning.

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-09T09:22:47.159Z · LW · GW
Are you predicting there won't be any lethal autonomous weapons before AGI?

No... thanks for pressing me on this.

Better at killing an a context where either: the operator would punish the agent if they knew, or the state would punish the operator if they knew. So the agent has to conceal its actions at whichever the level the punishment would occur.

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-08T10:11:46.465Z · LW · GW

You're right--valuable is the wrong word. I guess I mean better at killing.

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-07T17:38:10.863Z · LW · GW

Yep, I agree it is useless with a horizon length of 1. See this section:

For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.

So at longer horizons, the operator will presumably be pressing "enter" repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.

This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-07T10:23:04.169Z · LW · GW
At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs.

Picking subroutines to run isn't in its action space, so it doesn't pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we're not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say

(If this is outside its action space, then it can try to make a brainwashy message)

In one word??

Suppose you can't get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string

Again, its action space is printing one word to a screen. It's not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-04T10:35:50.435Z · LW · GW

Okay. I'll lower my confidence in my position. I think these two possibilities are strategically different enough, and each sufficiently plausible enough, that we should come up with separate plans/research agendas for both of them. And then those research agendas can be critiqued on their own terms.

For the purposes of this discussion, I think qualifies as a useful tangent, and this is the thread where a related disagreement comes to a head.

Edit: "valuable" was the wrong word. "Better at killing" is more to the point.

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-04T10:19:12.326Z · LW · GW
I mean that we don't have any process that looks like debate that could produce an agent that wasn't trying to kill you without being competitive

It took me an embarrassingly long time to parse this. I think it says: any debate-trained agent that isn't competitive will try to kill you. But I think the next clause clarifies that any debate-trained agent whose competitor isn't competitive will try to kill you. This may be moot if I'm getting that wrong.

So I guess you're imagining running Debate with horizons that are long enough that, in the absence of a competitor, the remaining debater would try to kill you. It seems to me that you put more faith in the mechanism that I was saying didn't comfort me. I had just claimed that a single-agent chatbot system with a long enough horizon would try to take over the world:

The existence of an adversary may make it harder for a debater to trick the operator, but if they're both trying to push the operator in dangerous directions, I'm not very comforted by this effect. The probability that the operator ends up trusting one of them doesn't seem (to me) so much lower than the probability the operator ends up trusting the single agent in the single-agent setup.

Running a debate between two entities that would both kill me if they could get away with it seems critically dangerous.

Suppose two equally matched people are trying shoot a basket from opposite ends of the 3-point line, before their opponent makes a basket. Each time they shoot, the two basketballs collide above the hoop and bounce off of each other, hopefully. Making the basket first = taking over the world and killing us on their terms. My view is that if they're both trying to make a basket, a basket being made is a more likely outcome than a basket not being made (if it's not too difficult for them to make the proverbial basket).

Side comment: so I think the existential risk is quite high in this setting, but I certainly don't think the existential risk is so low that there's little existential risk left to reduce with the boxing-the-moderator strategy. (I don't know if you'd have disputed that, but I've had conversations with others who did, so this seems like a good place to put this comment.)

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-04T09:18:08.685Z · LW · GW
No, but what are the approaches to avoiding deceptive alignment that don't go through competitiveness?

We could talk for a while about this. But I'm not sure how much hangs on this point if I'm right, since you offered this as an extra reason to care about competitiveness, but there's still the obvious reason to value competitiveness. And idea space is big, so you would have your work cut out to turn this from an epistemic landscape where two people can reasonably have different intuitions to an epistemic landscape that would cast serious doubt on my side.

But here's one idea: have the AI show messages to the operator that causes them to do better on randomly selected prediction tasks, and the operator's prediction depends on the message, obviously, but the ground truth is the counterfactual ground truth if the message were never shown, so the AI's message can't affect the ground truth.

And then more broadly, impact measures, conservatism, or utility information about counterfactuals to complicate wireheading, seem at least somewhat viable to me, and then you could have an agent that does more than show us text that's only useful if it's true. In my view, this approach is way more difficult to get safe, but if I had the position that we needed parity in competitiveness with unsafe competitors in order to use a chatbot to save the world, then I'd start to find these other approaches more appealing.

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-04T08:51:18.908Z · LW · GW

But your original comment was referring to a situation in which we didn't carefully control the AI in our lab. (By letting it have an arbitrarily long horizon). If we have lead time on other projects, I think it's very plausible to have a situation where we couldn't protect ourselves from our own AI if we weren't carefully controlling the conditions, but we could protect ourselves from our own AI if we we were carefully controlling the situation, and then given our lead time, we're not at a big risk from other projects yet.

Comment by michaelcohen on The "AI Debate" Debate · 2020-07-03T21:30:38.750Z · LW · GW
The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.

Point taken.

I think it is unlikely for a scheme like debate to be safe without being approximately competitive

The way I map these concepts, this feels like an elision to me. I understand what you're saying, but I would like to have a term for "this AI isn't trying to kill me", and I think "safe" is a good one. That's the relevant sense of "safe" when I say "if it's safe, we can try it out and tinker". So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.

use those answers [from Debate] to ensure ... that the overall system can be stable to malicious perturbations

Is "overall system" still referring to the malicious agent, or to Debate itself? If it's referring to Debate, I assume you're talking about malicious perturbations from within rather than malicious perturbations from the outside world?

If your honest answers aren't competitive, then you can't do that and your situation isn't qualitatively different from a human trying to directly supervise a much smarter AI.

You're saying that if we don't get useful answers out of Debate, we can't use the system to prevent malicious AI, and so we'd have to just try to supervise nascent malicious AI directly? I certainly don't dispute that if we don't get useful answers out of Debate, Debate won't help us solve X, including when X is "nip malicious AI in the bud".

It certainly wouldn't hurt to know in advance whether Debate is competitive enough, but if it really isn't dangerous itself, then I think we're unlikely to become so pessimistic about the prospects of Debate, through our arguments and our proxy experiments, that we don't even bother trying it out, so it doesn't seem especially decision-relevant to figure it out for sure in advance. But again, I take your earlier point that a better understanding of the landscape is always going to have some worth.

if your AI could easily kill you in order to win a debate, probably someone else's AI has already killed you

This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn't kill you (and helps you achieve your other goals). But it seems you're saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.

That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs

It seems fairly likely to me that the next best AGI project behind Deepmind, OpenAI, the USA, and China is way behind the best of those. I would think people in those projects would have months at least before some dark horse catches up.

So competitiveness still matters somewhat, but here's a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker. [Edit: "valuable" is the wrong word. I guess I mean better at killing.]

For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing)

Do you think something like IDA is the only plausible approach to alignment? If so, I hadn't realized that, and I'd be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: "any agent (we make) that learns to act will be treacherous if treachery is possible." Are all learning agents fundamentally out to get you? I suppose that's a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn't be recognized.

Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.

More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won't be at a competitive advantage.

I don't understand the dichotomy here. Are you talking about the problem of how to make it hard for a debater to take over the world within the course a debate? Or are you talking about the problem of how to make it hard for a debater to mislead the moderator? The solutions to those problems might be different, so maybe we can separate the concept "misaligned" into "ambitious" and/or "deceitful", to make it easier to talk about the possibility of separate solutions.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-31T13:27:11.148Z · LW · GW

So if taxes were 101% of the rental value, the price of the land (+ tax liability) would be negative, and all land would default to the government. This would be BAD. If taxes were 99% of the rental value, then I don't think this same problem happens. (Under a normal land tax, that would reduce the incentive to improve the land, but that's what all the machinery in this proposal is to avoid). And of 99% is cutting it too close, because predicted land value will only be a noisy estimate of the true value. So I disagree with the aim being to collect 100% of the land's rental value. I'd say the aim is to collect as much of the land's rental value as possible, while keeping a sufficiently small fraction of land from having negative value (once the tax liability is included). I wouldn't be surprised if this ends up meaning that the government could only collect ~2/3 of the land's rental value.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-29T10:18:49.655Z · LW · GW
I think this tax is fairly theoretical and un-implementable

I don't see why it's unimplementable. Do you mean politically difficult? That shouldn't detract from our ability to analyze the effects.

predicting second-order impact is not very helpful

This is a concrete way to answer the question "is it distortionary"

differential tax rates will and do shift some people and operations toward lower-tax jurisdictions

I'm imagining a federal tax that's the same everywhere.

My guess is that urbanization would slow a little bit

Can you explain why?

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T21:57:34.844Z · LW · GW

If the land is undevelopable, it doesn't really matter who does what with it. If the tax exceeds the value anyone can get out of it, it will default to the government (who will always buy land at $0). The government may not be a great land manager, but there's nothing to be done with this land anyway. If there's rural land nearby that is developable, maybe the land is actually a bit more valuable than the way it is currently being used, so it's not such a problem if the property tax is higher.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T21:45:57.889Z · LW · GW
I will make sure to press the button before I leave.

This would vindictive, and certainly illegal since it's their property now. I don't think the incentive do this is any more than the incentive to burn down someone's house if they've wronged you, or at least graffiti their house.

For example, instead of planting flowers in the ground in my garden, I would cover the garden with large boxes containing some ground

Or you could just increase the value you set for your property?

(To clarify, we're talking about the bottom proposal in this comment? In the original proposal, bidders make bids on the property and the owner can choose whether or not to accept the highest one.)

And gradually everyone would learn to do so, unless they want to pay twice the land tax as their neighbors.

People are paying tax based on the price of their own property; they're paying based on a prediction based on the values of their neighbors' properties.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T21:41:48.384Z · LW · GW

Yes, sorry, if you improve your neighbors' properties, that increases your tax burden. But that's usually only a small fraction of the value of the improvement to your property.

Substitution to a lower-tax is as much distortion as the same substitution to no-tax.

Would you claim that this tax reduces urbanization? For some reason, I'm not totally sure one way or the other. I agree that would count as a distortion.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T21:32:54.486Z · LW · GW

Well bidders bid for the property, so they'll "update" the prices by making higher or lower bids. And the predictions just use those bids as data.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T19:57:50.875Z · LW · GW

I don't know all these words.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T17:10:15.251Z · LW · GW

You don't need to focus on "non-taxable improvements" in this system. No improvements increase your tax burden.

You can live/work in a less valuable space, but this land gets taxed too, so it's not an *untaxed* substitute.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T10:13:00.301Z · LW · GW

Yep, I think a far-off starting date would be required. And maybe a modest one-time redistribution of wealth toward people for whom a large fraction of their wealth is in real-estate.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T10:09:32.233Z · LW · GW

I've only heard some things about it second-hand. But you're right I should probably read more of the literature :)

I'd say it's mostly about efficient taxing, and that fact that it's easier to exploit market forces to get an estimate of the actual value of a property than the "unimproved value" of the property.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T10:05:55.629Z · LW · GW

I guess I'd trust competition between private insurers to set better prices than a government agency.

I don't know much about how insurers are regulated, but I think this system works alright?

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T09:58:19.824Z · LW · GW

Actually it's worse--you'd have to require people to file transactions that reduce the value of their house. Otherwise you could just sell the marble right after you've written it off to lower your tax burden. But these transactions are easy to hide, and even if their not hidden, they might be hard to adjudicate in some circumstances, and clever people will look for the hardest-to-adjudicate cases.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T09:54:48.595Z · LW · GW

I think I disagree with the principle that it's easier to offset tax incentives than to eliminate them. But I wouldn't take the other 100% of the time either.

For example, tax land at its improved value, and allow the cost of any improvements to be deducted against up to half the tax on the land for the entity that made the improvement.

If the tax is ever above twice the rate of returns on US Treasury bonds, no property-tax-payer will ever buy US treasury bonds. Instead, they'd buy a huge chunk of marble for a counter-top and hide it under wood. Then they deduct half the cost of the marble, and the rate of return on this risk-free investment is half the value of the property tax.

Plus, if you can only deduct half, then improvements are dis-incentivized.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-28T09:45:58.679Z · LW · GW

You're definitely not misunderstanding anything. I guess I was imagining that the highest bid usually wouldn't be less than ~75% of the true property value, and this is fine. But you may be right that the highest bid could be much lower than that, if it's just not worth people's time to bid.

but then the owner would almost never want to sell through the bidding system

I guess I never clarified that I'm imagining this is the only legal way to sell property.

A few ideas:

For any given house, in any given year, there is a 1/1000 chance that the house is given to the highest bidder at the price they've bid, and the government pays the owner double that price as well (so they get triple the price).

Predictors try to predict the what price the property will have when it is next sold. (And maybe the number of years of price history and the number of neighbors has to be increased)

The government supplements long-standing bids (at say 1% a year), so if a property has had a recent inspection (which is tax-encouraged), and you can lower bound its value, you may as well put in a bid early at around that lower bound, in the hopes that it's not much more valuable than that, and by the time the owner want to sell, your bid will be inflated for free.

Do you think any of these are workable?


Or I guess the property owner could set the price, and anyone could buy at the listed price. Absent game theory stuff, this price could be way too high, since they're not paying property tax according to this price, which is why I didn't go for this in first place. But if people do this, they'd incur the wrath of their neighbors, whose property taxes would go up. If the social pressure isn't enough, the fear that their neighbors might retaliate by setting their own property prices way too high could encourage people to set more appropriate prices for their property. This might have costs to social cohesion... I guess you could also add a 0.1% tax on the listed value of the property.


I'm liking the last possibility more and more. I think the fairest way for a homeowner's association to make sure no one is overinflating the price of their property would be to recruit an independent consultant to estimate the value of everyone's properties, and then require that nobody set the price of their property more than, say, 30% higher than the consultant's estimate. That outcome would, of course, be excellent for the government.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-27T19:16:26.181Z · LW · GW
Why land? This would seem to apply to any transferable asset.

This could work for other assets where

  • each asset has a natural peer group (in this case, other properties in the neighborhood) from which to predict the value; or the value can't change so you can just use the market price of the asset itself
  • it's hard to hide the asset
  • the asset can't be imported/exported, or you don't care if your country loses this asset. For diamonds, needlessly distortionary but not a disaster; for car manufacturing equipment, very bad.

ETA from Wei Dai:

  • there are no untaxed substitutes for the asset
Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-27T19:05:16.655Z · LW · GW
But the broader point is that the predictions are based on the price history of the last 10 years

This is just extra information. If you don't think it will be of much additional use beyond the information about the current prices of neighboring properties, then you'd predict that the best predictors will ignore the historical data. To be sure, no one is forcing the predictors to just output the mean property value of neighbors' properties over the last 10 years.

the understanding that some places are tax bargains

If a whole neighborhood is understood to be a tax bargain, prices will go up, and so will the taxes. (Good predictors will probably focus mostly on these new prices of the neighbors' properties).

It looks to me like the same mechanism that preserves the incentive to improve your land works to exclude significant changes in land value more generally

If I build a microchip factory on empty land, the value of my property goes up by a much larger factor than the value of neighboring properties. And it is the increase in the value of neighboring properties that (roughly) determines the increase in property tax I pay. So I don't quite get to keep 100% of the value I created for myself, but I think it's close to 100%.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-27T16:02:04.452Z · LW · GW
Would I be correct in inferring you are thinking about this largely through a computer science lens, rather than an economic one?

I don't really know what lenses are... but I am a computer scientist.

If a formerly rural area is developed, they *should* be paying taxes as if the land's only value was as rural land, at least until nearby land gets developed too. I don't understand what the industrial facility example is illustrating, unless its the opposite of "local industry collapse". Time discontinuities like the local water supply being contaminated or a local industry collapsing aren't a problem at all, since the value of neighboring properties goes down, so the predicted value of the property probably will as well. If you mean the water supply only to that property becomes contaminated, then if it's not their fault, they can sue, and if it is their fault, their tax burden shouldn't go down.

The bids themselves are just a system by which the market value of every property can be discovered and aggregated.

Comment by michaelcohen on Predicted Land Value Tax: a better tax than an unimproved land value tax · 2020-05-27T15:50:06.566Z · LW · GW
You can fix the same problem with a simple insurance contract, though.

Yeah, see

Owners of tax liability would be required to take out insurance to limit their liability if they don’t own the property
Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-04-25T08:52:11.886Z · LW · GW

Thanks Rohin!

Comment by michaelcohen on How important are MDPs for AGI (Safety)? · 2020-03-27T20:56:28.364Z · LW · GW

Regarding regret bounds, I don't think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

Certainly many problems can be captured already within this simple setting.

Definitely. But I think many of the difficulties with general intelligence are not captured in the simple setting. I certainly don't want to say there's no place for MDPs.

continuous MDPs

I don't quite know what to think of continuous MDPs. I'll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it's not a whole lot more powerful than the finite-state MDP formalism.

Second, we may be able to combine finite-state MDP techniques with an algorithm that learns the relevant features, where "features" in this case corresponds to a mapping from histories to states.

Yeah, I think there's been some good progress on this. But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there's a good chance that we'll get for free the don't-evaluate-plans-twice behavior.

Comment by michaelcohen on How important are MDPs for AGI (Safety)? · 2020-03-27T11:05:41.055Z · LW · GW

I don't really go into the potential costs of a finite-state-Markov assumption here. The point of this post is mostly to claim that it's not a hugely useful framework for thinking about RL.

The short answer for why I think there are costs to it is that the world is not finite-state Markov, certainly not fully observable finite state Markov. So yes, it could "remove information" by oversimplifying.

That section of the textbook seems to describe the alternative I mentioned: treating the whole interaction history as the state. It's not finite-state anymore, but you can still treat the environment as fully observable without losing any generality, so that's good. So if I were to take issue more strongly here, my issue would not be with the Markov property, but the finite state-ness.

Comment by michaelcohen on How to have a happy quarantine · 2020-03-19T13:00:24.654Z · LW · GW


Comment by michaelcohen on How to have a happy quarantine · 2020-03-18T14:18:17.326Z · LW · GW

I've purchased the expansions on Hit me up if you want to play.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-23T11:06:11.719Z · LW · GW

The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there's not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There's still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn't quite allow us to say that it will pick the right point on the trade-off. As long as we have access to "parents" that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.

And I'd say it's more a conclusion, not a main one.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-21T10:08:01.464Z · LW · GW

The last paragraph of the conclusion (maybe you read it?) is relevant to this.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-21T10:06:04.562Z · LW · GW

Certainly for the true environment, the optimal policy exists and you could follow it. The only thing I’d say differently is that you’re pretty sure the laws of physics won’t change tomorrow. But more realistic forms of uncertainty doom us to either forego knowledge (and potentially good policies) or destroy ourselves. If one slowed down science in certain areas for reasons along the lines of the vulnerable world hypothesis, that would be taking the “safe stance” in this trade off.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-20T21:06:19.827Z · LW · GW
How does one make even weaker guarantees of good behavior

I don't think there's really a good answer. Section 6 Theorem 4 is my only suggestion here.

Comment by michaelcohen on Curiosity Killed the Cat and the Asymptotically Optimal Agent · 2020-02-20T21:01:36.319Z · LW · GW

Well, nothing in the paper has to do with MDPs! The results are for general computable environments. Does that answer the question?

Comment by michaelcohen on What's the dream for giving natural language commands to AI? · 2020-01-04T05:38:07.995Z · LW · GW

In the scheme I described, the behavior can be described as "the agent tries to get the text 'you did what we wanted' to be sent to it." A great way to do this would be to intervene in the provision of text. So the scheme I described doesn't make any progress in avoiding the classic wireheading scenario. The second possibility I described, where there are some games played regarding how different parameters are trained (the RNN is only trained to predict observations, and then another neural network originates from a narrow hidden layer in the RNN and produces text predictions as output) has the exact same wireheading pathology too.

Changing the nature of the goal as a function of what text it sees also doesn't stop "take over world, and in particular, the provision of text" from being an optimal solution.

I still am uncertain if I'm missing some key detail in your proposal, but right now my impression is that it falls prey to the same sort of wireheading incentive that a standard reinforcement learner does.

Comment by michaelcohen on What's the dream for giving natural language commands to AI? · 2020-01-01T19:13:55.437Z · LW · GW

I don't have a complete picture of the scheme. Is it: "From a trajectory of actions and observations, an English text sample is presented with each observation, and the agent has to predict this text alongside the observations, and then it acts according to some reward function like (and this is simplified) 1 if it sees the text 'you did what we wanted' and 0 otherwise"? If the scheme you're proposing is different than that, my guess is that you're imagining a recurrent neural network architecture and most of the weights are only trained to predict the observations, and then other weights are trained to predict the text samples. Am I in the right ballpark here?

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-24T20:23:15.195Z · LW · GW

I jumped off a small cliff into a lake once, and when I was standing on the rock, I couldn't bring myself to jump. I stepped back to let another person go, and then I stepped onto the rock and jumped immediately. I might be able to do something similar.

But I wouldn't be able to endorse such behavior while reflecting on it if I were in that situation, given my conviction that I am unable to change math. Indeed, I don't think it would be wise of me to cooperate in that situation. What I really mean when I say that I would rather be someone who cooperated in a twin prisoners dilemma is "conditioned the (somewhat odd) hypothetical that I will at some point end up in a high stakes twin prisoner's dilemma, I would rather it be the case that I am the sort of person who cooperates", which is really saying that I would rather play a twin prisoner's dilemma game against a cooperator than against a defector, which is just an obvious preference for a favorable event to befall me rather than an unfavorable one. In similar news, conditioned on my encountering a situation in the future where somebody checks to see if am I good person, and if I am, they destroy the world, then I would like to become a bad person. Conditioned on my encountering a situation in which someone saves the world if I am devout, I would like to become a devout person.

If I could turn off the part of my brain that forms the question "but why should I cooperate, when I can't change math?" that would be a path to becoming a reliable cooperator, but I don't see a path to silencing a valid argument in my brain without a lobotomy (short of possibly just cooperating really fast without thinking, and of course without forming the doubt "wait, why am I trying to do this really fast without thinking?").

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-24T04:40:20.985Z · LW · GW
If that's the case, then I assume that you defect in the twin prisoner's dilemma.

I do. I would rather be someone who didn't. But I don't see path to becoming that person without lobotomizing myself. And it's not a huge concern of mine, since I don't expect to encounter such a dilemma. (Rarely am I the one pointing out that a philosophical thought experiment is unrealistic. It's not usually the point of thought experiments to be realistic--we usually only talk about them to evaluate the consequences of different positions. But it is worth noting here that I don't see this as a major issue for me.) I haven't written this up because I don't think it's particularly urgent to explain to people why I think CDT is correct over FDT. Indeed, in one view, it would be cruel of me to do so! And I don't think it matters much for AI alignment.

Don't you think that's at least looking into?

This was partly why I decided to wade into the weeds, because absent a discussion of how plausible it is that we could affect things non-causally, yes, one's first instinct would be that we should look at least into it. And maybe, like, 0.1% of resources directed toward AI Safety should go toward whether we can change Math, but honestly, even that seems high. Because what we're talking about is changing logical facts. That might be number 1 on my list of intractable problems.

After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals.

This is getting subtle :) and it's hard to make sure our words mean things, but I submit that causal counterfactuals are much less fictitious than logical counterfactuals! I submit that it is less extravagant to claim we can affect this world than it is to claim that we can affect hypothetical worlds with which we are not in causal contact. No matter what action I pick, math stays the same. But it's not the case that no matter what action I pick, the world stays the same. (In the former case, which action I pick could in theory tell us something about what mathematical object the physical universe implements, but it doesn't change math.) In both cases, yes, there is only one action that I do take, but assuming we can reason both about causal and logical counterfactuals, we can still talk sensibly about the causal and logical consequences of picking actions I won't in fact end up picking. I don't have a complete answer to "how should we define causal/logical counterfactuals" but I don't think I need to for the sake of this conversation, as long as we both agree that we can use the terms in more or less the same way, which I think we are successfully doing.

I don't yet see why creating a CDT agent avoids catastrophe better than FDT.

I think running an aligned FDT agent would probably be fine. I'm just arguing that it wouldn't be any better than running a CDT agent (besides for the interim phase before Son-of-CDT has been created). And indeed, I don't think any new decision theories will perform any better than Son-of-CDT, so it doesn't seem to me to be a priority for AGI safety. Finally, the fact that no FDT agent has actually been fully defined certainly weighs in favor of just going with a CDT agent.

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-24T00:21:09.272Z · LW · GW

Ah. I agree that this proposal would not optimize causally inaccessible areas of the multiverse, except by accident. I also think that nothing we do optimizes causally inaccessible areas of the multiverse, and we could probably have a long discussion about that, but putting a pin in that,

Let's take things one at a time. First, let's figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.

Regarding the longer discussion, and sorry if this below my usual level of clarity: what do we have at our disposal to make counterfactual worlds with low utility inconsistent? Well, all that we humans have at our disposal is choices about actions. One can play with words, and say that we can choose not just what to do, but also who to be, and choosing who to be (i.e. editing our decision procedure) is supposed by some to have logical consequences, but I think that's a mistake. 1) Changing who we are is an action like any other. Actions don't have logical consequences, just causal consequences. 2) We might be changing which algorithm our brain executes, but we are not changing the output of any algorithm itself, the latter possibility being the thing with supposedly far-reaching (logical) consequences on hypothetical worlds outside of causal contact. In general, I'm pretty bearish on the ability of humans to change math.

Consider the CDT person who adopts FDT. They are probably interested in the logical consequences of the fact their brain in this world outputs certain actions. But no mathematical axioms have changed along the way, so no propositions have changed truth value. The fact that their brain now runs a new algorithm implies that (the math behind) physics ended up implementing that new algorithm. I don't see how it implies much else, logically. And I think the fact that no mathematical axioms have changes supports that intuition quite well!

The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn't change without axioms changing. So if you have ambitions of rendering low-utility world inconsistent, I guess my question is this: which axioms of Math would you like to change and how? I understand you don't hope to causally affect this, but how could you even hope to affect this logically? (I'm struggling to even put words to that; the most charitable phrasing I can come up with, in case you don't like "affect this logically", is "manifest different logic", but I worry that phrasing is Confused.) Also, I'm capitalizing Math there because this whole conversation involves being Platonists about math, where Math is something that really exists, so you can't just invent a new axiomatization of math and say the world is different now.

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-23T23:32:10.538Z · LW · GW

You're taking issue with my evaluating the causal consequences of our choice of what program to run in the agent rather than the logical consequences? These should be the same in practice when we make an AGI, since we're not in some weird decision problem at the moment, so far as I can tell. Or if you think I'm missing something, what are the non-causal, logical consequences of building a CDT AGI?

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-23T23:23:16.889Z · LW · GW

Side note: I think the term "self-modify" confuses us. We might as well say that agent's don't self-modify; all they can do is cause other agents to come into being and shut themselves off.

The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner's dilemma), but after a short period of time, it won't matter how it behaves. Some better agent will be created and take over from there.

Finally, if you think an FDT agent will perform very well in this world, then you should also expect Son-of-CDT to look a lot like an FDT agent.

Comment by michaelcohen on Build a Causal Decision Theorist · 2019-09-23T22:36:27.910Z · LW · GW

Why do you say "probably"? If there exists an agent that doesn't make those wrong choices you're describing, and if the CDT agent is capable of making such an agent, why wouldn't the CDT agent make an agent that makes the right choices?