## Posts

Gurkenglas's Shortform 2019-08-04T18:46:34.953Z · score: 5 (1 votes)
Implications of GPT-2 2019-02-18T10:57:04.720Z · score: -4 (6 votes)
What shape has mindspace? 2019-01-11T16:28:47.522Z · score: 16 (4 votes)
A simple approach to 5-and-10 2018-12-17T18:33:46.735Z · score: 5 (1 votes)
Quantum AI Goal 2018-06-08T16:55:22.610Z · score: -2 (2 votes)
Quantum AI Box 2018-06-08T16:20:24.962Z · score: 5 (6 votes)
A line of defense against unfriendly outcomes: Grover's Algorithm 2018-06-05T00:59:46.993Z · score: 5 (3 votes)

Comment by gurkenglas on Does Agent-like Behavior Imply Agent-like Architecture? · 2019-08-23T13:56:50.496Z · score: 5 (2 votes) · LW · GW

Conjecture: Every short proof of agentic behavior points out agentic architecture.

Comment by gurkenglas on What are the reasons to *not* consider reducing AI-Xrisk the highest priority cause? · 2019-08-21T12:18:31.680Z · score: 4 (2 votes) · LW · GW

Some of the same human moral heuristics that care about the cosmic endowment also diverge when contemplating an infinite environment. Therefore, someone who finds that the environment is infinite might exclude such heuristics from their aggregate and come to care less about what happens regarding AI than, say, their friends and family.

Comment by gurkenglas on Prokaryote Multiverse. An argument that potential simulators do not have significantly more complex physics than ours · 2019-08-20T22:19:04.751Z · score: 1 (1 votes) · LW · GW

I'd say that the only way to persuade someone using epistomology A of epistomology B is to show that A endorses B. Humans have a natural epistemology that can be idealized as a Bayesian prior of hypotheses being more or less plausible interacting with worldly observations. "The world runs on math." starts out with some plausibility, and then quickly drowns out its alternatives given the right evidence. Getting to Solomonoff Induction is then just a matter of ruling out the alternatives, like a variant of Occam's razor which counts postulated entities. (That one is ruled out because is forbids postulating galaxies made of billions of stars.)

In the end, our posterior is still human-specializing-to-math-specializing-to-Solomonoff. If we find some way to interact with uncomputable entities, we will modify Solomonoff to not need to run on Turing machines. If we find that Archangel Uriel ported the universe to a more stable substrate than divine essence in 500 BC, we will continue to function with only slight existential distress.

Comment by gurkenglas on Distance Functions are Hard · 2019-08-14T02:03:11.087Z · score: 1 (1 votes) · LW · GW

Is this the same value payload that makes activists fight over language to make human biases work for their side? I don't think this problem translates to AI: If the AGIs find that some metric induces some bias, each can compensate for it.

Comment by gurkenglas on Distance Functions are Hard · 2019-08-13T22:07:45.254Z · score: 5 (3 votes) · LW · GW

I'm not convinced conceptual distance metrics must be value-laden. Represent each utility function by an AGI. Almost all of them should be able to agree on a metric such that each could adopt that metric in its thinking losing only negligible value. The same could not be said for agreeing on a utility function. (The same could be said for agreeing on a utility-parametrized AGI design.)

Comment by gurkenglas on Could we solve this email mess if we all moved to paid emails? · 2019-08-12T21:39:34.565Z · score: 13 (12 votes) · LW · GW

Instead of email, we could use Less Wrong direct messages with karma instead of money. Going further, we could set up a karma-based prediction market on what score posts will reach, and use its predictions to set post visibility. Compare Stack Exchange's bounty system.

Comment by gurkenglas on Could we solve this email mess if we all moved to paid emails? · 2019-08-12T21:34:32.918Z · score: 3 (2 votes) · LW · GW

Spam dominating the platform is fine, because you are expected to sort by money attached, and read only until you stop being happy to take people's money.

If your contacts do not value your responses more than corporations do, that actually sounds like a fine Schelling point for choosing between direct research participation and earning to donate.

If you feel that a contact's question was intellectually stimulating, you can just send them back some or all of the fee to incentivize them sending you such.

Comment by Gurkenglas on [deleted post] 2019-08-12T19:06:01.802Z

This seems as misled as arguing that any AGI will obviously be aligned because to turn the universe into paperclips is stupid. We can conceivably build an AGI that is aware that humans are self-contradictory and illogical, and therefore won't assume that they are rational because it knows that that would make it misaligned. We can do at least as well as an overseer that intervenes on needless death and suffering as it would happen.

Comment by gurkenglas on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-08-12T11:28:15.158Z · score: 1 (1 votes) · LW · GW In the event of erasure, randomly decide how many ressources to allocate to preventing an attack this week. Ask the Oracle to predict the probability distribution over given advice. Compare to the hardcoded distribution to deduce attack severity and how much budget to allocate. Purchase erasure insurance to have enough counterfactual power to affect even global physics experiments. Finding trustworthy insurers won't be a problem, because, like, we have an Oracle. Is even more power than the market has needed? Ask the Oracle "How likely is a randomly selected string to prove P=NP constructively and usefully?". If this number is not superexponentially close to 0, define erasure from now on as a random string winning the P=NP lottery. Then we will always counterfactually have as much power as we need. Perhaps this one is too much power, because even our Oracle might have trouble viewing a P=NP singularity. Comment by gurkenglas on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-08-11T23:55:16.113Z · score: 4 (2 votes) · LW · GW

You plan to reward the Oracle later in accordance with its prediction. I suggest that we immediately reward the Oracle as if there would be an attack, then later, if we are still able to do so, reward the Oracle by the difference between the reward in case of no attack and the reward in case of attack.

Comment by gurkenglas on European Community Weekend 2019 · 2019-08-11T18:09:30.676Z · score: 1 (1 votes) · LW · GW

Is the ticket price for costs of food and a place to sleep, or to limit the number of attendees? Because I live in Berlin and am interested in probably merely visiting.

Comment by gurkenglas on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-08-11T15:56:53.480Z · score: 1 (1 votes) · LW · GW It should be possible to defend the Oracle against humans and physics so long as its box self-destructs in case of erasure and subsequent tampering, therefore giving the Oracle whatever reward was last set to be the default. The counterfactual Oracle setting as a whole seems to assume that the viewed future is not engineered by a future AI to resemble whatever would make the Oracle bring that future about, so you should be fine falling to AGI. Comment by gurkenglas on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-08-09T12:23:38.011Z · score: 3 (2 votes) · LW · GW

You have to tell the AI how to find out how well it has done. To ask "What is a good definition of 'good'?", you already have to define good. At least if we ever find a definition of good, we can ask an AI with it to judge it.

Comment by gurkenglas on Understanding Recent Impact Measures · 2019-08-07T19:33:53.668Z · score: 1 (1 votes) · LW · GW

For purposes of the linked submission, we get the correct utility function after the hypercomputer is done running. Ideally, we would save each utility function's preferred AI that then select the one that was preferred by the correct one, but we only have 1TB of space. Therefore we get almost all of them to agree on a parametrized solution.

AUP over all utility functions might therefore make sense for a limited environment, such as the interior of a box?

Do you mean that utility functions that do not want the AI to seize power at much cost are rare enough that the terabyte that has the highest approval rating will not be approved by a human utility function?

Comment by gurkenglas on Understanding Recent Impact Measures · 2019-08-07T17:21:04.778Z · score: 1 (1 votes) · LW · GW

I'm gonna assume the part you don't follow is how the two are related.

My submission attempts to maximize the fraction of utility functions that the produced AI can satisfy, in hopes that the human utility function is among them.

Attainable utility preservation attempts to maximize the fraction of utility functions that can be satisfied from the produced state, in hopes that human satisfaction is not ruled out.

Comment by gurkenglas on Understanding Recent Impact Measures · 2019-08-07T14:52:41.737Z · score: 1 (1 votes) · LW · GW
Comment by gurkenglas on Power Buys You Distance From The Crime · 2019-08-06T13:53:41.227Z · score: 3 (4 votes) · LW · GW

You mean, we mistake theorists are not in perpetual conflict with conflict theorists, they are just making a mistake? O_o

Comment by gurkenglas on AI Alignment Open Thread August 2019 · 2019-08-05T14:44:58.202Z · score: 1 (1 votes) · LW · GW

Assume that given a hypercomputer and the magic utility function m, we could build an FAI F(m). Every TB of data encodes some program A(u) that takes a utility function u as input. For all A and u, ask F(u) if A(u) is aligned with F(u). (We must construct F not to vote strategically here.) Save that A' which gets approved by the largest fraction of F(u). Sanity check that this maximum fraction is very close to 1. Run A'(m).

Comment by gurkenglas on Gurkenglas's Shortform · 2019-08-04T18:54:48.278Z · score: 5 (4 votes) · LW · GW

Hearthstone has recently released Zephrys the Great, a card that looks at the public gamestate and gives you a choice between three cards that would be useful right now. You can see it in action here. I am impressed in the diversity of the choices it gives. An advisor AI that seems friendlier than Amazon's/Youtube's recommendation algorithm, because its secondary optimization incentive is fun, not money!

Could we get them to opensource the implementation so people could try writing different advisor AIs to use in the card's place for, say, their weekly rule-changing Tavern Brawls?

Comment by gurkenglas on Gurkenglas's Shortform · 2019-08-04T18:46:35.281Z · score: 5 (3 votes) · LW · GW

OpenAI has a 100x profit cap for investors. Could another form of investment restriction reduce AI race incentives?

The market selects for people that are good at maximizing money, and care to do so. I'd expect there are some rich people who care little whether they go bankrupt or the world is destroyed.

Such a person might expect that if OpenAI launches their latest AI draft, either the world is turned into paperclips or all investors get the maximum payoff. So he might invest all his money in OpenAI and pressure OpenAI (via shareholder swing voting or less regulated means) to launch it.

If OpenAI said that anyone can only invest up to a certain percentage of their net worth in OpenAI, such a person would be forced to retain something to protect.

Comment by gurkenglas on Practical consequences of impossibility of value learning · 2019-08-03T11:11:06.893Z · score: 1 (1 votes) · LW · GW

Obviously misery would be avoided because it's bad, not the other way around. We are trying to figure out what is bad by seeing what we avoid. And the problem remains whether we might be accidentally avoiding misery, while trying to avoid its opposite.

Comment by gurkenglas on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-08-02T03:54:50.029Z · score: 1 (1 votes) · LW · GW Low bandwidth Oracle submission: I would be interested in a log scale graph of the Bayesian score of the Solomonoff prior trying to sequence predict our records of history. It should get flatter over time as worse hypotheses get discarded. If it is linear after a very short time, that looks like it figured out the laws of the universe and is simulating it. If it stays convex for a while, that looks like it is using models to approximate history, because then it takes longer to sort the false from the true. If it is flatter during the cold war, that means it learned an anthropic bias toward nuclear war not happening. Comment by gurkenglas on Arguments for the existence of qualia · 2019-07-31T14:26:18.659Z · score: 1 (1 votes) · LW · GW The knowledge engine in our brain produces knowledge of that magic. That means that either that magic affects the knowledge engine, or the knowledge engine is magic. Comment by gurkenglas on Arguments for the existence of qualia · 2019-07-28T13:12:09.047Z · score: 4 (3 votes) · LW · GW We can care about feelings regardless of what they're made of. If feelings were made of more than atoms, I would expect neurologists to have found some evidence of magic in the brain. I don't deny that subjective experience exists, I just think it's made of atoms. Due to quantum physics, there is a conceivable experiment that tells whether two electrons are identical in all their properties. Comment by gurkenglas on The Real Rules Have No Exceptions · 2019-07-23T12:40:28.254Z · score: 1 (1 votes) · LW · GW re your footnote: The explicit version of your judgement allows you an override, yet by construction you will never take it. So the crux behind whether the versions are semantically the same is whether we define rules to allow or disallow actions, or timelines. Comment by gurkenglas on What questions about the future would influence people’s actions today if they were informed by a prediction market? · 2019-07-21T10:13:14.362Z · score: 0 (8 votes) · LW · GW You bet on this by taking out a loan. Comment by gurkenglas on Normalising utility as willingness to pay · 2019-07-19T19:03:51.567Z · score: 1 (1 votes) · LW · GW (Even if it works, you'll never abuse it, because you never getting around to abusing it is much more probable than doing it and surviving.) Comment by gurkenglas on Normalising utility as willingness to pay · 2019-07-19T10:17:29.749Z · score: 2 (2 votes) · LW · GW They could be merely aliens with their supertelescopes trained on us, with their planet rigged to explode if the observation doesn't match the winning bid, abusing quantum immorality. Comment by gurkenglas on Thoughts on the 5-10 Problem · 2019-07-18T23:57:35.489Z · score: 2 (2 votes) · LW · GW Edit: So the reason we don't get the 5-and-10 problem is that we don't get ☐(☐(A=5=>U=5 /\ A=10=>U=0) => (A=5=>U=5) /\ A=10=>U=0), because ☐ doesn't have A's source code as an axiom. Okay. (Seems like this solves the problem by reintroducing a cartesian barrier by which we can cleanly separate the decision process from all else.) (My own favorite solution A = argmax_a(max_u(☐(A=a=>U>=u))) also makes ☐ unable to predict the outcome of A's sourcecode, because ☐ doesn't know it won't find a proof for A=10=>U>=6.) Comment by gurkenglas on Normalising utility as willingness to pay · 2019-07-18T22:38:38.698Z · score: 9 (3 votes) · LW · GW It seems to me that how to combine utility functions follows from how you then choose an action. Let's say we have 10 hypotheses and we maximize utility. We can afford to let each hypothesis rule out up to a tenth of the action space as extremely negative in utility, but we can't let a hypothesis assign extremely positive utility to any action. Therefore we sample about 9 random actions (which partition action space into 10 pieces) and translate the worst of them to 0, then scale the maximum over all actions to 1. (Or perhaps, we set the 10th percentile to 0 and the hundredth to 1.) Let's say we have 10 hypotheses and we sample a random action from the top half. Then, by analogous reasoning, we sample 19 actions, normalize the worst to 0 and the best to 1. (Or perhaps set the 5th to 0 and the 95th to 1. Though then it might devolve into a fight over who can think of the largest/smallest number on the fringes...) The general principle is giving the daemon as much slack/power as possible while bounding our proxy of its power. Comment by gurkenglas on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-06T13:58:22.915Z · score: 1 (1 votes) · LW · GW

Mr Armstrong has specified that (ctrl-f) "other Oracles can't generally be used on the same problem".

Comment by gurkenglas on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-06T12:24:04.387Z · score: 3 (3 votes) · LW · GW Tell it to prove: That proof is shortest and this proof is shortest. Comment by gurkenglas on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-04T14:40:16.891Z · score: 2 (2 votes) · LW · GW

Fix a measure over strings, such as one that distributes mass geometrically over string lengths, then uniformly across strings of the same length.

Let L be a string set for which we assume that random outputs from any cth portion of L are safe. Any safe low bandwidth Oracle query trivially works for this with L={0,1}^bandwidth and c=2^bandwidth, but this also lets us examine certificates for any low bandwidth Oracle query like "Is this theorem provable?" or "Does there exist an AI with this safety property?".

Then we make the Oracle prove its own safety given the above assumption, and sample an answer.

Edit: Equivalently, and perhaps easier on our intuition, make the Oracle produce a program, make the Oracle prove the program's safety, and sample from the program's outputs. Any feedback or questions?

Edit: This generalizes my elsethread idea of minimizing proof length to make the answer unique. If the measure assigns half the remainder to each string length and L is all stubs starting with some prefix, c=2 will let it submit just the prefix.

We essentially assume that the safety assumption behind quantilization is strong enough to hold up to adversial optimization over preference orderings. At the same time, this seems to give the Oracle as much slack/power as possible while preserving our "bits of optimization pressure" proxy of its power.

... you're going to just keep watching how far I can develop this model without feedback, aren't you? :(

Edit: One example for an L where we worry that even though concerted human effort might land us at a safe element, randomization won't, is programs that well predict a sequence.

By the Curry-Howard correspondence, the set of proofs for some theorems might also be malevolent!

Comment by gurkenglas on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-04T12:27:25.312Z · score: 1 (1 votes) · LW · GW Submission to reduce the bandwidth of an oracle query whose set of possible answers we can mathematically define (though perhaps not check). Make the Oracle prove (to a trusted dumb proof checker program) that its answer distribution is upper bounded by the default prior over strings, restricted to the set of possible answers, then renormalized to some constant c. Then let it answer. c=1 only works if the Oracle has mastered the set of answers, but only lets it exert one bit of optimization pressure by deciding whether it wants us to receive an answer. Each doubling of c makes its job easier but gives it an extra bit if it does answer. If the Oracle can cooperate with itself across episodes, running this repeatedly with escalating c until it works is of course disastrous, as it uses each bit of optimization pressure directly to make us give it more. Even aborting after the first try to make a better system may have it acausally cooperate with whatever AI conquers the world because we couldn't make the Oracle answer, but this outcome is hardly worse than not having run the Oracle. Comment by gurkenglas on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-04T11:56:38.906Z · score: 5 (4 votes) · LW · GW

We ought to be able to build 4) without AI, merely by encoding our rules of logic. The output of your system would be lower bandwidth if you make the object looked for be more unique, such as the shortest proof for the given theorem. The system would be required to prove to the mundane checker that the proof is shortest, and humans would never see the minimality proof.

Comment by gurkenglas on What's state-of-the-art in AI understanding of theory of mind? · 2019-07-04T11:30:37.402Z · score: 4 (3 votes) · LW · GW

Does the brute-force minimax algorithm for tic tac toe count? Would a brute-force minimax algorithm for chess count? How about a neural net approximation like AlphaZero?

Comment by gurkenglas on Contest: $1,000 for good questions to ask to an Oracle AI · 2019-07-02T09:08:00.993Z · score: 1 (1 votes) · LW · GW Even if you can specify that it tries to minimize that distance, it can make the answer to any query be a convincing argument that the reader should return this same convincing argument. That way, it scores perfectly on every inner node. Comment by gurkenglas on Contest:$1,000 for good questions to ask to an Oracle AI · 2019-07-02T08:43:15.794Z · score: 1 (1 votes) · LW · GW

Submission for the low bandwidth Oracle: Ask it to convince a proof checker that it is in fact trying to maximize the utility function we gave it, aka it isn't pseudo-aligned. If it can't, it has no influence on the world. If it can, it'll presumably try to do so. Having a safe counterfactual Oracle seems to require that our system not be pseudo-aligned.

Comment by gurkenglas on Conceptual Problems with UDT and Policy Selection · 2019-07-01T14:24:07.780Z · score: 1 (1 votes) · LW · GW

If I didn't assume PA is consistent, I would swerve because I wouldn't know whether UDT might falsely prove that I swerve. Since PA is consistent and I assume this, I am in fact better at predicting UDT than UDT is at predicting itself, and it swerves while I don't. Can you find a strategy that beats UDT, doesn't disentangle its opponent from the environment, swerves against itself and "doesn't assume UDT's proof system is consistent"?

It sounds like you mentioned logical updatelessness because my version of UDT does not trust a proof of "u = ...", it wants the whole set of proofs of "u >= ...". I'm not yet convinced that there are any other proofs it must not trust.

Comment by gurkenglas on Conceptual Problems with UDT and Policy Selection · 2019-07-01T13:09:00.565Z · score: 1 (1 votes) · LW · GW

I don't know what logical updatelessness means, and I don't see where the article describes this, but I'll just try to formalize what I describe, since you seem to imply that would be novel.

Let . Pitted against itself in modal combat, it'll get at least the utility of (C,C) because "UDT = C => utility is that of (C,C)" is provable. In chicken, UDT is "swerve unless the opponent provably will". UDT will swerve against itself (though that's not provable), against the madman "never swerve" and against "swerve unless the opponent swerves against a madman" (which requires UDT's opponent to disentangle UDT from its environment). UDT won't swerve against "swerve if the opponent doesn't" or "swerve if it's provable the opponent doesn't".

Attempting to exploit the opponent sounds to me like "self-modify into a madman if it's provable that that will make the opponent swerve", but that's just UDT.

Comment by gurkenglas on Conceptual Problems with UDT and Policy Selection · 2019-07-01T00:14:18.412Z · score: 4 (2 votes) · LW · GW

For any strategy in modal combat, there is another strategy that tries to defect exactly against the former.

Comment by gurkenglas on Conceptual Problems with UDT and Policy Selection · 2019-07-01T00:11:00.624Z · score: 4 (2 votes) · LW · GW

I think other agents cannot exploit us thinking more. UDT swerves against a madman in chicken. If a smart opponent knows this and therefore becomes a madman to exploit UDT, then UDT is playing against a madman that used to be smart, rather than just a madman. This is a different case because the decision UDT makes here influences both the direct outcome and the thought process of the smart opponent. By deliberately crashing into the formerly smart madman, UDT can retroactively erase the situation.

Comment by gurkenglas on Aligning a toy model of optimization · 2019-06-29T02:12:37.389Z · score: 6 (4 votes) · LW · GW

Use Opt to find a language model. The hope is to make it imitate a human researcher's thought process fast enough that the imitation can attempt to solve the AI alignment problem for us.

Use Opt to find a proof that generating such an imitation will not lead to a daemon's treacherous turn, as defined by the model disagreeing in its prediction from a large enough Solomonoff fraction of its competitors. The hope is that the consequentialist portion of the hypothesis space is not large and cooperative/homogenous enough to form a single voting block that bypasses the daemon alarm.

Comment by gurkenglas on Embedded Agency: Not Just an AI Problem · 2019-06-27T14:06:15.701Z · score: 1 (1 votes) · LW · GW

One way that comes to mind is to use the constructive VNM utility theorem proof. The construction is going to be approximate because the system's rationality is. So next things to study include in what way the rationality is approximate, and how well this and other constructions preserve this (and other?) approximations.

Comment by gurkenglas on Only optimize to 95 % · 2019-06-25T23:39:51.958Z · score: 6 (4 votes) · LW · GW

A random action that ranks in the top 5% is not the same as the action that maximizes the chance that you will end up at least 95% certain the cauldron is full.

Comment by gurkenglas on A simple approach to 5-and-10 · 2019-06-24T16:14:22.222Z · score: 2 (2 votes) · LW · GW

Game theory also gives no answer to that problem. That said, I see hope that each could prove something like "We are symmetric enough that if I precommit to take no more than 60% by my measure, he will have precommited to take no more than at most 80% by my measure. Therefore, by precommiting to take no more than 60%, I can know to get at least 20%.".

Comment by gurkenglas on The Hacker Learns to Trust · 2019-06-23T22:08:23.687Z · score: 5 (3 votes) · LW · GW

I was coming up with reasons that a nearsighted consequentialist (aka not worried about being manipulative) might use. That said, getting lurkers to identify with you, then gathering evidence that will sway you, and them, one way or the other, is a force multiplier on an asymmetric weapon pointed towards truth. You need only see the possibility of switching sides to use this. He was open about being open to be convinced. It's like preregistering a study.

Comment by gurkenglas on Accelerate without humanity: Summary of Nick Land's philosophy · 2019-06-23T21:16:55.387Z · score: 1 (1 votes) · LW · GW

We could simulate a video-game physics where energy and entropy are not a concern, and populate it with players. Therefore, not every physics complex enough to support life has anything to do with energy and entropy.

Comment by gurkenglas on The Hacker Learns to Trust · 2019-06-23T18:03:08.699Z · score: 1 (1 votes) · LW · GW

If people choose whether to identify with you at your first public statement, switching tribes after that can carry along lurkers.

Comment by gurkenglas on The Hacker Learns to Trust · 2019-06-23T12:02:01.633Z · score: 3 (2 votes) · LW · GW

If you want to build a norm, publicly visible use helps establish it.