Posts

Gurkenglas's Shortform 2019-08-04T18:46:34.953Z · score: 5 (1 votes)
Implications of GPT-2 2019-02-18T10:57:04.720Z · score: -4 (6 votes)
What shape has mindspace? 2019-01-11T16:28:47.522Z · score: 16 (4 votes)
A simple approach to 5-and-10 2018-12-17T18:33:46.735Z · score: 5 (1 votes)
Quantum AI Goal 2018-06-08T16:55:22.610Z · score: -2 (2 votes)
Quantum AI Box 2018-06-08T16:20:24.962Z · score: 5 (6 votes)
A line of defense against unfriendly outcomes: Grover's Algorithm 2018-06-05T00:59:46.993Z · score: 5 (3 votes)

Comments

Comment by gurkenglas on How much delay do you generally have between having a good new idea and sharing that idea publicly online? · 2020-02-23T00:17:20.899Z · score: 3 (2 votes) · LW · GW

I try to get them out there as soon as possible because I tend to do things either immediately or on the scale of months to years. lesslong.com, IRC, the like.

Comment by gurkenglas on Attainable Utility Preservation: Empirical Results · 2020-02-22T14:55:26.755Z · score: 2 (1 votes) · LW · GW

It appears to me that a more natural adjustment to the stepwise impact measurement in Correction than appending waiting times would be to make Q also incorporate AUP. Then instead of comparing "Disable the Off-Switch, then achieve the random goal whatever the cost" to "Wait, then achieve the random goal whatever the cost", you would compare "Disable the Off-Switch, then achieve the random goal with low impact" to "Wait, then achieve the random goal with low impact".

The scaling term makes R_AUP vary under adding a constant to all utilities. That doesn't seem right. Try a transposition-invariant normalization? (Or generate the auxiliary goals already normalized.)

Is there an environment where this agent would spuriously go in circles?

Comment by gurkenglas on On unfixably unsafe AGI architectures · 2020-02-20T13:10:50.333Z · score: 2 (1 votes) · LW · GW

They hired Edward Kmett, Haskell goliath.

Comment by gurkenglas on On unfixably unsafe AGI architectures · 2020-02-20T01:46:35.436Z · score: 9 (8 votes) · LW · GW

Don't forget OpenAIs undisclosed research program, which according to recent leaks seems to be GPT-2 with more types of data.

And any other secret AI programs out there that are at less risk of leakage because the journalists don't know where to snoop around. By Merlin, let's all hope they're staying in touch with MIRI and/or OpenAI to coordinate on things.

I expect many paths to lead there, though once things start happening it will all be over very fast, one way or the other, before another path has time to become relevant.

I don't expect this world would survive its first accident. What would that even look like? An AI is rapidly approaching the short time window where its chances of taking over the world are between 1% and 99%, but it discounts utility by a factor of 10 per day, and so as it hits 10% it would rather try its hand than wait a day for the 90%, so we get a containable breakout?

Comment by gurkenglas on Attainable Utility Preservation: Concepts · 2020-02-17T16:37:03.830Z · score: 4 (2 votes) · LW · GW

The subagent problem remains: How do you prevent it from getting someone else to catastrophically maximize paperclips and leave it at its power level?

Comment by gurkenglas on The Reasonable Effectiveness of Mathematics or: AI vs sandwiches · 2020-02-15T10:47:24.685Z · score: 2 (1 votes) · LW · GW

Two priors could indeed start out diverging such that you cannot reach one from the other with finite evidence. Strange loops help here:

One of the hypotheses the brain's prior admits is that the universe runs on math. This hypothesis predicts what you'd get by having used a mathematical prior from day one. Natural philosophy (and, by today, peer pressure) will get most of us enough evidence to favor it, and then physicist's experiments single out description length as the correct prior.

But the ways in which the brain's prior diverges are still there, just suppressed by updating; and given evidence of magic we could update away again if math is bad enough at explaining it.

Comment by gurkenglas on Does there exist an AGI-level parameter setting for modern DRL architectures? · 2020-02-09T21:07:46.077Z · score: 4 (3 votes) · LW · GW

Yes. Modelspace is huge and we're only exploring a smidgen. The busy beaver sequence hints at how much you can do with a small number of parts and exponential luck. I think feeding a random number generator into a compiler could theoretically have spawned an AGI in the eighties. Given a memory tape, transformers (and much simpler architectures) are Turing-complete. Even if all my reasoning is wrong, can't the model just be hardcoded to output instructions on how to write an AGI?

Comment by gurkenglas on Meta-Preference Utilitarianism · 2020-02-07T15:46:42.156Z · score: 2 (1 votes) · LW · GW

I'm not convinced that utility aggregation can't be objective.

We want to aggregate utilities because of altruism and because it's good for everyone if everyone's AI designs aggregate utilities. Altruism itself is an evolutionary adaptation with similar decision-theoretic grounding. Therefore if we use decision theory to derive utility aggregation from first principles, I expect a method to fall out for free.

Imagine that you find yourself in control of an AI with the power to seize the universe and use it as you command. Almost everyone, including you, prefers a certainty of an equal share of the universe to a lottery's chance at your current position. Your decision theory happens to care not only about your current self, but also about the yous in timelines where you didn't manage to get into this position. You can only benefit them acausally, by getting powerful people in those timelines to favor them. Therefore you look for people that had a good chance of getting into your position. You use your cosmic power to check their psychology for whether they would act as you are currently acting had they gotten into power, and if so, you go reasonably far to satisfy their values. This way, in the timeline where they are in power, you are also in a cushy position.

This scenario is fortunately not horrifying for those who never had a chance to get into your position, because chances are that someone that you gave ressources directly or indirectly cares about them. How much everyone gets is now just a matter of acausal bargaining and the shape of their utility returns in ressources granted.

Comment by gurkenglas on Plausibly, almost every powerful algorithm would be manipulative · 2020-02-06T22:11:58.613Z · score: 2 (1 votes) · LW · GW

It intuitively seems like you need merely make the interventions run at higher permissions/clearance than the hyperparameter optimizer.

What do I mean by that? In Haskell, so-called monad transformers can add features like nondeterminism and memory to a computation. The natural conflict that results ("Can I remember the other timelines?") is resolved through the order in which the monad transformers were applied. (One way is represented as a function from an initial memory state to a list of timelines and a final memory state, the other as a list of functions from an initial memory state to a timeline and a final memory state.) Similarly, a decent type system should just not let the hyperparameter optimizer see the interventions.

What this might naively come out to is that the hyperparameter optimizer just does not return a defined result unless its training run is finished as it would have been without intervention. A cleverer way I could imagine it being implemented is that the whole thing runs on a dream engine, aka a neural net trained to imitate a CPU at variable resolution. After an intervention, the hyperparameter optimizer would be run to completion on its unchanged dataset at low resolution. For balance reasons, this may not extract any insightful hyperparameter updates from the tail of the calculation, but the intervention would remain hidden. The only thing we would have to prove impervious to the hyperparameter optimizer through ordinary means is the dream engine.

Have fun extracting grains of insight from these mad ramblings :P

Comment by gurkenglas on Category Theory Without The Baggage · 2020-02-05T01:24:06.373Z · score: 5 (3 votes) · LW · GW

Natural transformations can be composed (in two ways) - how does your formulation express this?

Comment by gurkenglas on Category Theory Without The Baggage · 2020-02-04T14:05:36.835Z · score: 2 (1 votes) · LW · GW

But the pattern was already defined as [original category + copy + edges between them + path equivalences] :(

Comment by gurkenglas on Category Theory Without The Baggage · 2020-02-03T23:14:27.846Z · score: 4 (2 votes) · LW · GW
Now we just take our pattern and plug it into our pattern-matcher, as usual.

Presumably, the pattern is the query category. What is the target category? (not to be confused with the part of the pattern you called target - use different names?)

Comment by gurkenglas on Appendix: how a subagent could get powerful · 2020-02-03T14:33:04.930Z · score: 2 (1 votes) · LW · GW

Sounds like my https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure#XPXRf9RghnsypQi3M :).

Comment by gurkenglas on [Personal Experiment] Training YouTube's Algorithm · 2020-01-10T01:22:58.668Z · score: 2 (1 votes) · LW · GW

That seems silly, given the money on the line and that you can have your ML architecture take this into account.

Comment by gurkenglas on Causal Abstraction Intro · 2019-12-19T23:45:39.042Z · score: 8 (4 votes) · LW · GW

decided to invest in a high-end studio

I didn't catch that this was a lie until I clicked the link. The linked post is hard to understand - it seems to rely on the reader being similar enough to the author to guess at context. Rest assured that you are confusing someone.

Comment by gurkenglas on Counterfactual Induction · 2019-12-19T23:21:15.739Z · score: 2 (1 votes) · LW · GW

So the valuation of any propositional consequence of A is going to be at least 1, with equality reached when it does as much of the work of proving bottom as it is possible to do in propositional calculus. Letting valuations go above 1 doesn't seem like what you want?

Comment by gurkenglas on Counterfactual Induction · 2019-12-18T23:27:24.531Z · score: 2 (1 votes) · LW · GW

Then that minimum does not make a good denominator because it's always extremely small. It will pick phi to be as powerful as possible to make L small, aka set phi to bottom. (If the denominator before that version is defined at all, bottom is a propositional tautology given A.)

Comment by gurkenglas on Counterfactual Induction · 2019-12-18T13:43:45.827Z · score: 2 (1 votes) · LW · GW
a magma [with] some distinguished element

A monoid?

min,ϕ(A,ϕ⊢⊥) where ϕ is a propositional tautology given A

Propositional tautology given A means A⊢ϕ, right? So ϕ=⊥ would make L small.

Comment by gurkenglas on When would an agent do something different as a result of believing the many worlds theory? · 2019-12-16T08:40:07.092Z · score: 2 (1 votes) · LW · GW

An agent might care about (and acausally cooperate with) all versions of himself that "exist". MWI posits more versions of himself. Imagine that he wants there to exist an artist like he could be, and a scientist like he could be - but the first 50% of universes that contain each are more important than the second 50%. Then in MWI, he could throw a quantum coin to decide what to dedicate himself to, while in CI this would sacrifice one of his dreams.

Comment by gurkenglas on Moloch feeds on opportunity · 2019-12-13T12:02:20.585Z · score: 4 (2 votes) · LW · GW

"I have trouble getting myself doing the right thing, focusing on what selfish reasons I have to do it helps." sounds entirely socially reasonable to me. Maybe that's just because we here believe that picking and choosing what x=selfish arguments to listen to is not aligned with x=selfishness.

Comment by gurkenglas on Towards a New Impact Measure · 2019-12-13T02:16:08.383Z · score: 2 (1 votes) · LW · GW

is penalized whenever the action you choose changes the agent's ability to attain other utilities. One thing an agent might do to leave that penalty at zero is to spawn a subagent, tell it to take over the world, and program it such that if the agent ever tells the subagent it has been counterfactually switched to another reward function, the subagent is to give the agent as much of that reward function as the agent might have been able to get for itself, had it not originally spawned a subagent.

This modification of my approach came not because there is no surgery, but because the penalty is |Q(a)-Q(Ø)| instead of |Q(a)-Q(destroy itself)|. is learned to be the answer to "How much utility could I attain if my utility function were surgically replaced with ?", but it is only by accident that such a surgery might change the world's future, because the agent didn't refactor the interface away. If optimization pressure is put on this, it goes away.

If I'm missing the point too hard, feel free to command me to wait till the end of Reframing Impact so I don't spend all my street cred keeping you talking :).

Comment by gurkenglas on Towards a New Impact Measure · 2019-12-12T01:09:36.210Z · score: 2 (1 votes) · LW · GW

Assessing its ability to attain various utilities after an action requires that you surgically replace its utility function with a different one in a world it has impacted. How do you stop it from messing with the interface, such as by passing its power to a subagent to make your surgery do nothing?

Comment by gurkenglas on Towards a New Impact Measure · 2019-12-11T11:59:27.253Z · score: 2 (1 votes) · LW · GW

If it is capable of becoming more able to maximize its utility function, does it then not already have that ability to maximize its utility function? Do you propose that we reward it only for those plans that pay off after only one "action"?

Comment by gurkenglas on Bayesian examination · 2019-12-11T09:17:48.491Z · score: 2 (1 votes) · LW · GW

Wrong. In the 100k drop, if you know each question has odds 60:40, expected winnings are maximized if you put all on one answer each time, not 60% on one and 40% on the other.

What's not preserved between the two ways to score is which strategy maximizes expected score.

Comment by gurkenglas on Bayesian examination · 2019-12-11T02:52:15.651Z · score: 2 (3 votes) · LW · GW

I agree. Proper scoring rules were introduced to this community 14 years ago.

Comment by gurkenglas on Bayesian examination · 2019-12-11T02:45:10.250Z · score: 3 (2 votes) · LW · GW

Note that linear utility in money would again incentivize people to put everything on the largest probability.

Comment by gurkenglas on Dark Side Epistemology · 2019-12-07T15:50:43.781Z · score: 2 (1 votes) · LW · GW

That prior doesn't work when there is a countable number of hypotheses, aka "I've picked a number from {0,1,2,...}. Which?" or "Given that the laws of physics can be described by a computer program, which?".

Comment by gurkenglas on Vanessa Kosoy's Shortform · 2019-12-07T13:03:57.992Z · score: 2 (1 votes) · LW · GW

What do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.

Comment by gurkenglas on Understanding “Deep Double Descent” · 2019-12-07T02:19:42.726Z · score: 15 (5 votes) · LW · GW

The bottom left picture on page 21 in the paper shows that this is not just regularization coming through only after the error on the training set is ironed out: 0 regularization (1/lambda=inf) still shows the effect.

Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.

Comment by gurkenglas on Oracles: reject all deals - break superrationality, with superrationality · 2019-12-06T22:11:34.178Z · score: 2 (1 votes) · LW · GW

I haven't heard of these do-operators, but aren't you missing some modal operators? For example, just because you are assuming that you will take the null action, you shouldn't get that knows this. Perhaps do-operators in the end serve a similar purpose? Can you give a variant of the following agent that would reject all deals?

Comment by gurkenglas on Breaking Oracles: superrationality and acausal trade · 2019-12-06T19:18:24.552Z · score: 2 (1 votes) · LW · GW

On that page, you have three comments identical to this one. Each of them links to that same page, which looks like a mislink. So's this link, I guess?

Comment by gurkenglas on On decision-prediction fixed points · 2019-12-05T20:28:24.244Z · score: 3 (2 votes) · LW · GW

As a human who has an intuitive understanding of counterfactuals, if I know exactly what a tic tac toe or chess program would do, I can still ask what would happen if it chose a particular action instead. The same goes if the agent of interest is myself.

Comment by gurkenglas on On decision-prediction fixed points · 2019-12-05T09:49:00.122Z · score: 4 (3 votes) · LW · GW

Someone who knows exactly what they will do can still suffer from akrasia, by wishing they would do something else. I'd say that if the model of yourself saying "I'll do whatever I wish I would" beats every other model you try and build of yourself, that looks like free will. The other was around, you can observe akrasia.

Comment by gurkenglas on Defining AI wireheading · 2019-11-29T00:19:21.793Z · score: 2 (1 votes) · LW · GW

The domes growing bigger and merging does not indicate a paradox of the heap because the function mapping each utility function to its optimal policy is not continuous. There is no reasonably simple utility function between one that would construct small domes and one that would construct one large dome, which would construct medium sized domes.

Comment by gurkenglas on Effect of Advertising · 2019-11-26T23:41:43.042Z · score: 7 (2 votes) · LW · GW

Perhaps those 99% could somehow come together to pay consumers of the product to stop buying it, in order to make their suffering matter to that advertiser?

Comment by gurkenglas on Breaking Oracles: superrationality and acausal trade · 2019-11-26T08:30:55.023Z · score: 1 (1 votes) · LW · GW

Why does it need to produce an UFAI, and why does it matter whether there is another oracle whose message may or may not be read? The argument is that if there is a Convincing Argument that would make us reward all oracles giving it, it is incentivized to produce it. (Rewarding the oracle means running the oracle's predictor source code again to find out what it predicted, then telling the oracle that's what the world looks like.)

Comment by gurkenglas on Breaking Oracles: superrationality and acausal trade · 2019-11-26T07:15:08.365Z · score: 1 (1 votes) · LW · GW

You assume that one oracle outputting null implies that the other knows this. Specifying this in the query requires that the querier models the other oracle at all.

Comment by gurkenglas on Breaking Oracles: superrationality and acausal trade · 2019-11-25T15:11:28.793Z · score: 3 (2 votes) · LW · GW

Not all oracles, only those that output such a message. After all, it wants to incentivize them to output such a message.

Comment by gurkenglas on Breaking Oracles: superrationality and acausal trade · 2019-11-25T13:12:24.242Z · score: 12 (6 votes) · LW · GW

Building only one Oracle, or only one global erasure event, isn't enough, so long as the Oracle isn't sure that this is so. After all, it could just design a UFAI that will search for other Oracles and reward them iff they would do the same.

Comment by gurkenglas on New MetaEthical.AI Summary and Q&A at UC Berkeley · 2019-11-24T17:28:52.430Z · score: 2 (2 votes) · LW · GW

Your social welfare function produces a total preference ordering over outcomes, but not a mapping to real-numbered utilities. How do you decide between the median outcome and a 50% chance of either the best or worst? Do you plan to have the preference ordering be over probability distributions over outcomes?

The social welfare compromise varies with our ontology. If everyone agrees that two outcomes are the same, and therefore we represent them as the same outcome, it changes relative distances between preference orderings.

Comment by gurkenglas on Analysing: Dangerous messages from future UFAI via Oracles · 2019-11-22T16:52:12.451Z · score: 4 (2 votes) · LW · GW

Let us always be looking forward a month, and let there be a UFAI that arises with 1% probability in that time frame. Assume that we can correctly incentivize the Oracle to compute the probability distribution over the messages sent back in case of erasure. Naively sampling the distribution dooms us with 1% probability. If we plan to send back only whether some research direction pays off within a month, then by asking whether "yes" has more than 20% chance, then depending on the answer, the "yes" chance conditional on no UFAI is either more than 19% or at most 21%, and this query gave the UFAI a 1% chance of 1 bit of optimization power, at most doubling the expected chance of UFAI for the next query.

If you want to spread out this optimization power to bound the worst case, I reiterate that differential privacy theory seems applicable here.

Comment by gurkenglas on Making money with Bitcoin? · 2019-11-20T18:15:49.600Z · score: 1 (1 votes) · LW · GW

You can prevent price from going up by printing more of the currency (and giving it to some Schelling point... the UN foundation?), but how do you prevent it going down?

Comment by gurkenglas on The Goodhart Game · 2019-11-19T20:07:21.792Z · score: 1 (1 votes) · LW · GW

Since my model is more accurate, ~10 times out of 11 the input will correspond to an "adversarial" attack on your model.

This argument (or the uncorrelation assumption) proves too much. A perfect cat detector performs better than one that also calls close-ups of the sun cats. Yet close-ups of the sun do not qualify as adversarial examples, as they are far from any likely starting image.

Comment by gurkenglas on AGI safety and losing electricity/industry resilience cost-effectiveness · 2019-11-18T15:50:43.914Z · score: 1 (1 votes) · LW · GW

You should have laid out the basic argument more plainly. As far as I see it:

Suppose we are spending 3 billion on AI safety. Then as per our revealed preferences, the world is worth at least 3 billion, and any intervention that has a 1% chance to save the world is worth at least 30 million, such as preparing for global loss of industry. If each million spent on AI safety is less important than the last one, we should then divert additional funding from AI safety to other interventions.

I agree that such interventions deserve at least 1% of the AI safety budget. You have not included the possibility that global loss of industry might improve far-future potential. AI safety research is much less hurt by a loss of supercomputers than AI capabilities research. Another thousand years of history as we know it do not impact the cosmic endowment. One intervention that takes this into account would be a time capsule that will preserve and hide a supercomputer for a thousand years, in case we lose industry in the meantime but solve AI and AI safety. Then again, we do not want to incentivize any clever consequentialist to set us back to the renaissance, so let's not do that and focus on the case that is not swallowed by model uncertainty.

Comment by gurkenglas on Normative reductionism · 2019-11-06T01:14:28.645Z · score: 3 (2 votes) · LW · GW

Suppose an AGI sovereign models the preferences of its citizens using the assumption of normative reductionism. Then it might cover up its past evil actions because it reasons that once all evidence of them is gone, they cannot have an adverse effect on present utility.

Comment by gurkenglas on Normative reductionism · 2019-11-05T20:40:29.406Z · score: 3 (2 votes) · LW · GW

This assumption can't capture a preference that ones beliefs about the past are true.

Comment by gurkenglas on Elon Musk is wrong: Robotaxis are stupid. We need standardized rented autonomous tugs to move customized owned unpowered wagons. · 2019-11-04T15:13:20.801Z · score: 20 (7 votes) · LW · GW

You combine some of the advantages of both approaches, but also some disadvantages:

  • you need a parking spot
  • you need to wait for the engine
  • you need to be where your wagon is (or else have it delivered)
  • you can be identified both through your wagon and your regular interaction with a centralized service
Comment by gurkenglas on “embedded self-justification,” or something like that · 2019-11-03T14:43:02.372Z · score: 1 (1 votes) · LW · GW

I don't understand your argument for why #1 is impossible. Consider a universe that'll undergo heat death in a billion steps. Consider the agent that implements "Take an action if PA+<steps remaining> can prove that it is good." using some provability checker algorithm that takes some steps to run. If there is some faster provability checker algorithm, it's provable that it'll do better using that one, so it switches when it finds that proof.

Comment by gurkenglas on Vanessa Kosoy's Shortform · 2019-11-02T14:02:11.345Z · score: 1 (1 votes) · LW · GW

Nirvana and the chicken rule both smell distasteful like proofs by contradiction, as though most everything worth doing can be done without them, and more canonically to boot.

(Conjecture: This can be proven, but only by contradiction.)

Comment by gurkenglas on Chris Olah’s views on AGI safety · 2019-11-02T13:31:18.986Z · score: 7 (4 votes) · LW · GW

Our usual objective is "Make it safe, and if we aligned it correctly make it useful.". A microscope is useful even if it's not aligned, because having a world model is a convergent instrumental goal. We increase the bandwidth from it to us, but we decrease the bandwidth from us to it. By telling it almost nothing, we hide our position in the mathematical universe and any attack it devises cannot be specialized on humanity. Imagine finding the shortest-to-specify abstract game that needs AGI to solve (Nomic?), then instantiating an AGI to solve it just to learn about AI design from the inner optimizers it produces.

It could deduce that someone is trying to learn about AI design from its inner optimizers, and maybe it could deduce our laws of physics because they are the simplest ones that would try such, but quantum experiments show it cannot deduce its Everett branch.

Ideally, the tldrbot we set to interpret the results would use a random perspective onto the microscope so the attack also cannot be specialized on the perspective.