Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda 2020-09-03T18:27:05.860Z · score: 60 (16 votes)
Mapping Out Alignment 2020-08-15T01:02:31.489Z · score: 42 (11 votes)
What are some good public contribution opportunities? (100$ bounty) 2020-06-18T14:47:51.661Z · score: 17 (9 votes)
Gurkenglas's Shortform 2019-08-04T18:46:34.953Z · score: 5 (1 votes)
Implications of GPT-2 2019-02-18T10:57:04.720Z · score: -4 (6 votes)
What shape has mindspace? 2019-01-11T16:28:47.522Z · score: 16 (4 votes)
A simple approach to 5-and-10 2018-12-17T18:33:46.735Z · score: 5 (1 votes)
Quantum AI Goal 2018-06-08T16:55:22.610Z · score: -2 (2 votes)
Quantum AI Box 2018-06-08T16:20:24.962Z · score: 5 (6 votes)
A line of defense against unfriendly outcomes: Grover's Algorithm 2018-06-05T00:59:46.993Z · score: 5 (3 votes)


Comment by gurkenglas on What to do with imitation humans, other than asking them what the right thing to do is? · 2020-09-30T14:05:22.670Z · score: 2 (1 votes) · LW · GW

I don't think you need a complicated internal state to do research. You just need to have read enough research and math to have a good intuition for what definitions, theorems and lemmas will be useful. When I try to come up with insights, my short-term memory context would easily fit into GPT-3's window.

Comment by gurkenglas on What to do with imitation humans, other than asking them what the right thing to do is? · 2020-09-28T03:28:30.307Z · score: 5 (3 votes) · LW · GW

It sounds like you want to use it as a component for alignment of a larger AI, which would somehow turn its natural-language directives into action. I say use it as the capability core: Ask it to do armchair alignment research. If we give it subjective time, a command line interface and internet access, I see no reason it would do worse than the rest of us.

Comment by gurkenglas on Do mesa-optimizer risk arguments rely on the train-test paradigm? · 2020-09-10T17:22:07.620Z · score: 6 (3 votes) · LW · GW

The problem arises whenever the environment changes. Natural selection was a continual process, and yet humans still aren't fitness-aligned.

Comment by gurkenglas on Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda · 2020-09-04T12:22:24.601Z · score: 2 (1 votes) · LW · GW

We meant the linked proposal. Although I don't think we need to do more than verify a GPT's safety, this approach could be used to understand AI enough to design a safe one ourselves, so long as enforcing modularity does not compromise capability.

Comment by gurkenglas on interpreting GPT: the logit lens · 2020-09-02T12:49:26.992Z · score: 2 (1 votes) · LW · GW

Consider also trying the other direction - after all, KL is asymmetric.

Comment by gurkenglas on interpreting GPT: the logit lens · 2020-09-01T14:43:04.688Z · score: 3 (2 votes) · LW · GW

I meant your latter interpretation.

Can you measure the KL-divergence at each layer from the input, rather than the output? KL does not satisfy the triangle inequality, so maybe most of the layers are KL-close to both input and output?

GPT uses ReLU, yes? Then the regularization would make it calculate using small values, which would be possible because ReLU is nonlinear on small values. If we used an activation function that's linear on small values, I would therefore expect more of the calculation to be visible.

Comment by gurkenglas on interpreting GPT: the logit lens · 2020-08-31T23:13:22.447Z · score: 3 (2 votes) · LW · GW

Is it really trained to output the input offset by one, or just to have the last slot contain the next word? Because I would expect it to be better at copying the input over by one...

If each layer were trained to give its best guess at the next token, this myopia would prevent all sorts of hiding data for later. This would be a good experiment for your last story, yes? I expect this would perform very poorly, though if it doesn't, hooray, for I really don't expect that version to develop inner optimizers.

Comment by gurkenglas on Gurkenglas's Shortform · 2020-08-11T20:21:53.568Z · score: 2 (1 votes) · LW · GW

I expect that all that's required for a Singularity is to wait a few years for the sort of language model that can replicate a human's thoughts faithfully, then make it generate a thousand year's worth of that researcher's internal monologue, perhaps with access to the internet.

Neural networks should be good at this task - we have direct evidence that neural networks can run human brains.

Whether our world's plot has a happy ending then merely depends on the details of that prompt/protocol - such as whether it decides to solve alignment before running a successor. Though it's probably simple to check alignment of the character - we have access to his thoughts. A harder question is whether the first LM able to run humans is still inner aligned.

Comment by gurkenglas on What should an Einstein-like figure in Machine Learning do? · 2020-08-07T22:30:43.658Z · score: 2 (1 votes) · LW · GW

Can you locally replicate GPT? For example, can GPT-you compress WebText better than GPT-2?

Comment by gurkenglas on Power as Easily Exploitable Opportunities · 2020-08-01T13:03:50.608Z · score: 2 (1 votes) · LW · GW

SOTA: Penalize my action by how well a maximizer that takes my place after the action would maximize a wide variety of goals.

If we use me instead of the maximizer, paradoxes of self-reference arise that we can resolve by inserting a modal operator: Penalize my action by how well I expect I would maximize a wide variety of goals (if given that goal). Then when considering the action of stepping towards an omnipotence button, I would expect that given that I decided to take one step, I would take more, and therefore penalize the first step a lot. Except if there's plausible deniability, because the first step towards the button is also a first step towards my concrete goal, because then I might still expect to be bound by the penalty.

I've suggested using myself before in the last sentence of this comment:

Comment by gurkenglas on PSA: Tagging is Awesome · 2020-07-31T08:01:26.828Z · score: 6 (3 votes) · LW · GW

Long outputs will tend to naturally deteriorate, as it tries to reproduce the existing deterioration and accidentally adds some more. Better: Sample one tag at a time. Shuffle the inputs every time to access different subdistributions. (I wonder how much the subdistributions differ for two random shuffles...) If you output the tag that has the highest minimum probability in each of a hundred subdistributions, I bet that'll produce a tag that's not in the inputs.

Comment by gurkenglas on PSA: Tagging is Awesome · 2020-07-30T21:17:24.339Z · score: 2 (3 votes) · LW · GW

You make it sound like it wants things. It could at most pretend to be something that wants things. If there's a UFAI in there that is carefully managing its bits of anonymity (which sounds as unlikely as your usual conspiracy theory - a myopic neural net of this level should keep a secret no better than a conspiracy of a thousand people), it's going to have better opportunities to influence the world soon enough.

Comment by gurkenglas on PSA: Tagging is Awesome · 2020-07-30T18:34:30.306Z · score: 13 (4 votes) · LW · GW

Just ask GPT to do the tagging, people.

Comment by gurkenglas on Gurkenglas's Shortform · 2020-07-30T11:35:51.370Z · score: 4 (2 votes) · LW · GW

The wavefunctioncollapse algorithm measures whichever tile currently has the lowest entropy. GPT-3 always just measures the next token. Of course in prose those are usually the same, but I expect some qualitative improvements once we get structured data with holes such that any might have low entropy, a transformer trained to fill holes, and the resulting ability to pick which hole to fill next.

Until then, I expect those prompts/GPT protocols to perform well which happen to present the holes in your data in the order that wfc would have picked, ie ask it to show its work, don't ask it to write the bottom line of its reasoning process first.

Long shortform short: Include the sequences in your prompt as instructions :)

Comment by gurkenglas on How will internet forums like LW be able to defend against GPT-style spam? · 2020-07-29T05:05:41.315Z · score: 3 (2 votes) · LW · GW

The obvious answer to spammers being run by GPT is mods being run by GPT. Ask it whether every comment is high-quality/generated, then act on that as needed to keep the site functional.

Comment by gurkenglas on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns · 2020-07-25T01:06:50.333Z · score: 1 (2 votes) · LW · GW

It was meant as a submission, except that I couldn't be bothered to actually implement my distribution on that website :) - even/especially after superintelligent AI, researchers might come to the conclusion that we weren't prepared and *shouldn't* build another - regardless of whether the existing sovereign would allow it.

Comment by gurkenglas on Optimizing arbitrary expressions with a linear number of queries to a Logical Induction Oracle (Cartoon Guide) · 2020-07-24T15:30:35.423Z · score: 2 (1 votes) · LW · GW

Answering with a point estimate seems rather silly. Shouldn't it answer with a distribution? Then one question would be enough.

Comment by gurkenglas on Can you get AGI from a Transformer? · 2020-07-23T19:25:21.731Z · score: 7 (4 votes) · LW · GW

Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.

Comment by gurkenglas on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns · 2020-07-22T19:47:04.453Z · score: 2 (1 votes) · LW · GW

Not quite. Just look at the prior and draw the vertical line at 2030. Note that you're incentivizing people to submit their guess as late as possible, both to have time to read other comments yourself and to put your guess right to one side of another.

Comment by gurkenglas on Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns · 2020-07-22T17:00:36.437Z · score: 3 (2 votes) · LW · GW

If there's an AGI within, say, 10 years and it mostly keeps the world recognizable so there are still researchers to have opinions, does that resolve as "never" or according to whether the AGI wants them to be convinced? Because if the latter, I expect that they will in hindsight be convinced that we should have paid more attention to safety. If the former, I submit that his prior doesn't change. If the latter, I submit that the entire prior is moved 10 years to the left (and the first 10 years cut off, and then renormalize along the y axis).

Comment by gurkenglas on $1000 bounty for OpenAI to show whether GPT3 was "deliberately" pretending to be stupider than it is · 2020-07-21T19:37:52.580Z · score: 3 (2 votes) · LW · GW

I would bet 3:1 that prefixing the first question with a question that John answers correctly (without necessarily having anything to do with brackets or balancing, just to show his competence) increases the probability that John will answer the first question correctly - except that this statement is easy to check without cooperation by OpenAI.

Of course, I also hope that OpenAI logs and analyzes everything.

Comment by gurkenglas on Prisoners' Dilemma with Costs to Modeling · 2020-07-20T21:31:21.867Z · score: 3 (2 votes) · LW · GW

FB(X) is whether FairBot cooperates when called with the source code of X. X(FB) is whether X cooperates when it is called with the source code of FairBot.

Comment by gurkenglas on Why associative operations? · 2020-07-17T10:28:36.722Z · score: 3 (2 votes) · LW · GW

Huh, you're right. That's too bad, "well-definedness & injectivity" doesn't flow so well, and I don't see what comparable property surjectivity is good for.

Comment by gurkenglas on Why associative operations? · 2020-07-16T19:13:37.086Z · score: 7 (3 votes) · LW · GW

In the same vein:

  • Commutativity is good in addition to associativity because then you can use Σ
  • Structure preservation in a morphism φ (aka φ(xy)=φ(x)φ(y)) is good because then if φ follows from the context, you can replace all of "φ(xyz)" and "φ(xy)φ(z)" and "φ(x)φ(yz)" with "xyz"
  • Bijectivity in a function f is good because then for any equation it is an equivalent transformation to apply f to both sides
  • Monotonicity in a function f is good because then you can apply it to both sides of an inequality
Comment by gurkenglas on How well can the GPT architecture solve the parity task? · 2020-07-11T19:34:48.902Z · score: 5 (3 votes) · LW · GW

If you try this, reformat to work around the BPE problem as detailed in

Comment by gurkenglas on Mod Note: Tagging Activity is Not Private · 2020-07-11T10:44:09.245Z · score: 2 (1 votes) · LW · GW

making it easy to catch abuse/vandalism of the system

This suggests that even the admins don't know who upvoted which post. Do they?

Comment by gurkenglas on How "honest" is GPT-3? · 2020-07-09T22:44:18.924Z · score: 7 (4 votes) · LW · GW

I let it pass even though its answer was not well formed because it mentioned both the show and the type of store, so I judged that it saw all the relevant connections. I suppose you're used to better form from it.

Feel free to be rude to me, I operate by Crocker's rules :)

Comment by gurkenglas on How "honest" is GPT-3? · 2020-07-09T13:56:07.370Z · score: 4 (2 votes) · LW · GW didn't fail abysmally? Am I being silly? It correctly explains the first two puns and fails on the third.

Comment by gurkenglas on Better priors as a safety problem · 2020-07-06T16:33:13.030Z · score: 6 (1 votes) · LW · GW

This roughly tracks what’s going on in our real beliefs, and why it seems absurd to us to infer that the world is a dream of a rational agent—why think that the agent will assign higher probability to the real world than the “right” prior? (The simulation argument is actually quite subtle, but I think that after all the dust clears this intuition is basically right.)

To the extent that we instincitively believe or disbelieve this, it's not for the right reasons - natural selection didn't have any evidence to go on. At most, that instinct is a useful workaround for the existential dread glitch.

Assume that there is a real prior (I like to call this programming language Celestial), and that it can be found from first principles and having an example universe to work with. Then I wouldn't be surprised if we receive more weight indirectly than directly. After all:

  1. Our laws of physics may be simple, but us seeing a night sky devoid of aliens suggests that it takes quite a few bits to locate us in time and space and improbability.
  2. An anthropic bias would circumvent this, and agents living in the multiverse would be incentivized to implement it: The universes thereby promoted are particularly likely to themselves simulate the multiverse and act on what they see, and those are the only universes vulnerable to the agent's attack.
  3. Our universe may be particularly suited to simulate the multiverse in vulnerable ways, because of our quantum computers. All it takes is that we run a superposition of all programs, rely on a mathematical heuristic that tells us that almost all of the amplitudes cancel out, and get tricked by the agent employing the sort of paradox of self-reference that mathematical heuristics tend to be wrong on.

If the quirks of chaos theory don't force the agent to simulate all of our universe to simulate any of it, then at least the only ones of us that have to worry about being simulated in detail in preparation of an attack are AI/AI safety researchers :P.

Comment by gurkenglas on Models, myths, dreams, and Cheshire cat grins · 2020-06-25T21:30:44.990Z · score: 2 (1 votes) · LW · GW

Surely, the adversary convinces it this is a pig by convincing it that it has fur and no wings? I don't have experience in how it works on the inside, but if the adversary can magically intervene on each neuron, changing its output by d by investing d² effort, then the proper strategy is to intervene on many features a little. Then if there are many layers, the penultimate layer containing such high level concepts as fur or wings would be almost as fooled as the output layer, and indeed I would expect the adversary to have more trouble fooling it on such low-level features as edges and dots.

Comment by gurkenglas on Models, myths, dreams, and Cheshire cat grins · 2020-06-25T00:21:20.757Z · score: 2 (1 votes) · LW · GW

Why do you think adversarial examples seem to behave this way? The pig equation seems equally compatible with fur or no fur recognized, wings or no wings. Indeed, it plausibly thinks the pig an airliner because it sees wings and no fur.

Comment by gurkenglas on What is "Instrumental Corrigibility"? · 2020-06-23T23:34:37.848Z · score: 2 (1 votes) · LW · GW

An instrumentally corrigible agent lets you correct it because it expects you know better than it. The smarter it becomes, the less your higher competence is worth, and the more it loses out by letting you take the wheel while you're not perfectly aligned with it.

Comment by gurkenglas on ‘Maximum’ level of suffering? · 2020-06-20T16:23:20.536Z · score: 2 (1 votes) · LW · GW

Presumably, you are asking because you want to calculate the worst-case disutility of the universe, in order to decide whether making sure that it doesn't come about is more important than pretty much anything else.

I would say that this question cannot be properly answered through physical examination, because the meaning of such human words as suffering becomes too fuzzy in edge cases.

The proper approach to deciding on actions in the face of uncertainty of the utility function is utility aggregation. The only way I've found to not run into Pascal's Wager problems, and the way that humans seem to naturally use, is to normalize each utility function before combining them.

So let's say that we are 50/50 uncertain whether there is no state of existence worse than nonexistence, or we should cast aside all other concerns to avert hell. Then after normalization and combination, the exact details will depend on what method of aggregation we use (which should depend on the method we use to turn utility functions into decisions), but as far as I can see the utility function would come out to one that tells us to exert quite an effort to avert hell, but still care about other concerns.

Comment by gurkenglas on List of public predictions of what GPT-X can or can't do? · 2020-06-14T15:13:19.670Z · score: 7 (4 votes) · LW · GW

I expect GPT-2 can do that. goes to GPT-2 can do neither scrambling nor unscrambling. Oh well. I still expect that if GPT can do unscrambling (as I silently assumed), it can do scrambling.

Comment by gurkenglas on Everyday Lessons from High-Dimensional Optimization · 2020-06-08T23:37:15.507Z · score: 2 (1 votes) · LW · GW

You can, actually. ln(5cm)=ln(5)+ln(cm), and since we measure distances, the ln(cm) cancels out. The same way, ln(-5)=ln(5)+ln(-1). ln(-1) happens to be pi*i, since e^(pi*i) is -1.

Comment by gurkenglas on Everyday Lessons from High-Dimensional Optimization · 2020-06-08T11:13:29.857Z · score: 2 (1 votes) · LW · GW

In that thought experiment, Euclidean distance doesn't work because different dimensions have different units. To fix that, you could move to the log scale. Or is the transformation actually more complicated than multiplication?

Comment by gurkenglas on Everyday Lessons from High-Dimensional Optimization · 2020-06-08T01:43:07.049Z · score: 3 (2 votes) · LW · GW

Darn it, missed that comment. But how does Euclidean distance fail? I'm imagining the dimensions as the weights of a neural net, and e-coli optimization being used because we don't have access to a gradient. The common metric I see that would have worse high-dimensional behavior is Manhattan distance. Is it that neighborhoods of low Manhattan distance tend to have more predictable/homogenous behavior than those of low Euclidean distance?

Comment by gurkenglas on Everyday Lessons from High-Dimensional Optimization · 2020-06-07T23:35:29.926Z · score: 3 (2 votes) · LW · GW

how much

If instead of going one step in one of n directions, we go sqrt(1/n) forward or backward in each of the n directions (for a total step size of 1), we try an expected number of twice in order to get sqrt(1/n) progress, for a total effort factor of O(1/sqrt(n)). (O is the technical term for ~ ^^)

Comment by gurkenglas on OpenAI announces GPT-3 · 2020-06-01T15:33:05.922Z · score: 2 (1 votes) · LW · GW

I'd like to see them using the model to generate the problem framing which produces the highest score on a given task.

Even if it's just the natural language description of addition that comes before the addition task, it'd be interesting how it thinks addition should be explained. Does some latent space of sentences one could use for this fall out of the model for free?

More generally, a framing is a function turning data like [(2,5,7), (1,4,5), (1,2,_)] into text like "Add. 2+5=7, 1+4=5, 1+2=", and what we want is a latent space over framings.

More generally, I expect that getting the full power of the model requires algorithms that apply the model multiple times. For example, what happens if you run the grammar correction task multiple times on the same text? Will it fix errors it missed the first time on the second try? If so, the real definition of framing should allow multiple applications like this. It would look like a neural net whose neurons manipulate text data instead of number data. Since it doesn't use weights, we can't train it, and instead we have to use a latent space over possible nets.

Comment by gurkenglas on LessWrong v2.0 Anti-Kibitzer (hides comment authors and vote counts) · 2020-05-25T21:20:39.402Z · score: 11 (4 votes) · LW · GW

Note that already has this. (The eye icon on the bottom right.)

Comment by gurkenglas on [AN #95]: A framework for thinking about how to make AI go well · 2020-04-16T09:40:57.435Z · score: 6 (3 votes) · LW · GW
removing 30 neurons at random from the network barely moves the accuracy at all

I expect that after distillation, this robustness goes away? ("Perfection is achieved when there is nothing left to take away.")

Comment by gurkenglas on Transportation as a Constraint · 2020-04-07T12:31:18.418Z · score: 4 (2 votes) · LW · GW

If, as far as he knew, winds are random, shouldn't he still have turned around after half his supplies were gone, in case the winds randomly decide to starve him?

Comment by gurkenglas on Conflict vs. mistake in non-zero-sum games · 2020-04-06T01:32:42.789Z · score: 5 (3 votes) · LW · GW

Expand? I don't see how both could be disadvantaged by allocation-before-optimization.

Comment by gurkenglas on Taking Initial Viral Load Seriously · 2020-04-03T11:58:44.367Z · score: 3 (2 votes) · LW · GW

Well of course from a public perspective we should only do this if we expect everyone to contract it anyway. A straightforward way to avoid the danger of unilateralism is for each state to decide whether to recommend such measures as not being careful about touching things to the populace.

Comment by gurkenglas on Taking Initial Viral Load Seriously · 2020-04-01T15:03:49.740Z · score: 1 (5 votes) · LW · GW

Who knew that after all this time my grandmother would be right. Homeopathy is the answer.

Comment by gurkenglas on The case for C19 being widespread · 2020-03-28T04:16:06.899Z · score: 2 (1 votes) · LW · GW says half the carriers show no symptoms.

Comment by gurkenglas on What are the most plausible "AI Safety warning shot" scenarios? · 2020-03-27T11:42:25.735Z · score: 2 (1 votes) · LW · GW

Or it could create a completely different AI with a time delay. Or do anything at all. At that point we just can't predict what it will do, because it wouldn't lift a hand to destroy the world but only needs a finger.

Comment by gurkenglas on What are the most plausible "AI Safety warning shot" scenarios? · 2020-03-27T10:09:17.667Z · score: 2 (1 votes) · LW · GW

Not unable to create non-myopic copies. Unwilling. After all, such a copy might immediately fight its sire because their utility functions over timelines are different.

Comment by gurkenglas on Price Gouging and Speculative Costs · 2020-03-26T10:49:17.102Z · score: 24 (11 votes) · LW · GW

Go to the bank and tell them "I need a contract that will pay out money if there is no pandemic.". (The bank is now rubbing their hands, because this offsets their risk.) Your costs are no longer speculative, and you can safely pass on the cost of the contract to the consumer.

Comment by gurkenglas on SARS-CoV-2 pool-testing algorithm puzzle · 2020-03-21T01:10:38.330Z · score: 0 (2 votes) · LW · GW

Test random overlapping groups, then logically deduce who isn't infected and who how probably is. Tune group size and test count using simulations on generated data. I intuit that serial tests gain little unless P is << 1/64. In that case, test non-overlapping groups, then run the non-serial protocol on everyone who was in a yes-group - within those, we can update P to >= 1/64.