Tetraspace Grouping's Shortform

tetraspace-grouping

Tetraspace Grouping's Shortform

post by Tetraspace (tetraspace-grouping) · 2019-08-02T01:37:14.859Z · LW · GW · 22 comments

22 comments

22 comments

Comments sorted by top scores.

comment by Tetraspace (tetraspace-grouping) · 2020-04-19T23:38:15.083Z · LW(p) · GW(p)

PMarket Maker

Just under a month ago, I said "web app idea: one where you can set up a play-money prediction market with only a few clicks", because I was playing around on Hypermind and wishing that I could do my own Hypermind. It then occurred to me that I can make web apps, so after getting up to date on modern web frameworks I embarked in creating such a site.

Anyway, it's now complete enough to use, provided that you don't blow on it too hard. Here it is: pmarket-maker.herokuapp.com. Enjoy!

You can create a market, and then create a set of options within that market. Players can make buy and sell limit orders on those options. You can close an option and pay out a specific amount per owned share. There are no market makers, despite the pun in the name, but players start with 1000 internet points that they can use to shortsell.

EDIT 2023-02-25: Such a web app exists for real now, made as an actual product by other people who can devops and design UIs, it's called Manifold Markets.

comment by Tetraspace (tetraspace-grouping) · 2020-04-10T03:01:43.806Z · LW(p) · GW(p)

Thoughts on Ryan Carey's Incorrigibility in the CIRL Framework (I am going to try to post these semi-regularly).

This specific situation looks unrealistic. But it's not really trying to be too realistic, it's trying to be a counterexample. In that spirit, you could also just use $R_{2} (a, s_{d}) = 1000$ , which is a reward function parametrized by $θ$ that gives the same behavior but stops me from saying "Why Not Just set $θ = - 1$ ", which isn't the point.

How something like this might actually happen: you try to have your $R_{1}$ be a complicated neural network that can approximate any function. But you butcher the implementation and get something basically random instead, and this $R_{2}$ cannot approximate the real human reward.

An important insight this highlights well: An off-switch is something that you press only when you've programmed the AI badly enough that you need to press the off-switch. But if you've programmed it wrong, you don't know what it's going to do, including, possibly, its off-switch behavior. Make sure you know under which assumptions your off-switch will still work!
Assigning high value to shutting down is incorrigible, because the AI shuts itself down. What about assigning high value to being in a button state?
The paper considers a situation where the shutdown button is hardcoded, which isn't enough by itself. What's really happening is that the human either wants or doesn't want the AI to shut down, which sounds like a term in the human reward that the AI can learn.

One way to do this is for the AI to do maximum likelihood with a prior that assigns 0 probability to the human erroneously giving the shutdown command. I suspect there's something less hacky related to setting an appropriate prior over the reward assigned to shutting down.

The footnote on page 7 confuses me a bit - don't you want the AI to always defer to the human in button states? The answer feels like it will be clearer to me if I look into how "expected reward if the button state isn't avoided" is calculated.

Also I did just jump into this paper. There are probably lots of interesting things that people have said about MDPs and CIRLs and Q-values that would be useful.

Replies from: tetraspace-grouping

↑ comment by Tetraspace (tetraspace-grouping) · 2020-05-02T10:09:48.008Z · LW(p) · GW(p)

Thoughts on Dylan Hadfield-Menell et al.'s The Off-Switch Game.

I don't think it's quite right to call this an off-switch - the model is fully general to the situation where the AI is choosing between two alternatives A and B (normalized in the paper so that U(B) = 0), and to me an off-switch is a hardware override that the AI need not want you to press.
The wisdom to take away from the paper: An AI will voluntarily defer to a human - in the sense that the AI thinks that it can get a better outcome by its own standards if it does what the human says - if it's uncertain about the utilities, or if the human is rational.
This whole setup seems to be somewhat superseded by CIRL, which has the AI, uh, causally find $U_{A}$ by learning its value from the human actions, instead of evidentially(?) doing it by taking decisions that happen to land it on action A when $U_{A}$ is high because it's acting in a weird environment where a human is present as a side-constraint.

Could some wisdom to gain be that the high-variance high-human-rationality is something of an explanation as to why CIRL works? I should read more about CIRL to see if this is needed or helpful and to compare and contrast etc.

Why does the reward gained drop when uncertainty is too high? Because the prior that the AI gets from estimating the human reward is more accurate than the human decisions, so in too-high-uncertainty situations it keeps mistakenly deferring to the flawed human who tells it to take the worse action more often?

The verbal description, that the human just types in a noisily sampled value of $U_{A}$ , is somewhat strange - if the human has explicit access to their own utility function, they can just take the best actions directly! In practice, though, the AI would learn this by looking at many past human actions (there's some CIRL!) which does seem like it plausibly gives a more accurate policy than the human's (ht Should Robots Be Obedient).
The human is Boltzmann-rational in the two-action situation (hence the sigmoid). I assume that it's the same for the multi-action situation, though this isn't stated. How much does the exact way in which the human is irrational matter for their results?

Replies from: tetraspace-grouping

↑ comment by Tetraspace (tetraspace-grouping) · 2020-05-27T00:27:56.445Z · LW(p) · GW(p)

Thoughts on Abram Demski's Partial Agency [? · GW]:

When I read Partial Agency, I was struck with a desire to try formalizing this partial agency thing. Defining Myopia [? · GW] seems like it might have a definition of myopia; one day I might look at it. Anyway,

Formalization of Partial Agency: Try One

A myopic agent is optimizing a reward function $R (x_{0}, y (x_{0}))$ where $x$ is the vector of parameters it's thinking about and $y$ is the vector of parameters it isn't thinking about. The gradient descent step picks the $δ x$ in the direction that maximizes $R (x_{0} + δ x, y (x_{0}))$ (it is myopic so it can't consider the effects on $y$ ), and then moves the agent to the point $(x_{0} + δ x, y (x_{0} + δ x))$ .

This is dual to a stop-gradient agent, which picks the $δ x$ in the direction that maximizes $f (x_{0} + δ x, y (x_{0} + δ x))$ but then moves the agent to the point $(x_{0} + δ x, y (x_{0}))$ (the gradient through $y$ is stopped).

For example,

Nash equilibria - $x$ are the parameters defining the agent's behavior. $y (x_{0})$ are the parameters of the other agents if they go up against the agent parametrized by $x_{0}$ . $R$ is the reward given for an agent $x$ going up against a set of agents $y$ .
Image recognition with a neural network - $x$ is the parameters defining the network, $y (x_{0})$ are the image classifications for every image in the dataset for the network with parameters $x_{0}$ , and $R$ is the loss function plus the loss of the network described by $x$ on classifying the current training example.
Episodic agent - $x$ are parameters describing the agents behavior. $y (x_{0})$ are the performances of the agent $x_{0}$ in future episodes. $R$ is the sum of $y$ , plus the reward obtained in the current episode.

Partial Agency due to Uncertainty?

Is it possible to cast partial agency in terms of uncertainty over reward functions? One reason I'd be myopic is if I didn't believe that I could, in expectation, improve some part of the reward, perhaps because it's intractable to calculate (behavior of other agents) or something I'm not programmed to care about (reward in other episodes).

Let $R_{1}$ be drawn from a probability distribution over reward functions. Then one could decompose the true, uncertain, reward into $R^{'} = R_{0} (x_{0}) + R_{1} (x_{0})$ defined in such a way that $E (R_{1} (x_{0} + δ x) - R_{1} (x_{0})) \approx 0$ for any $δ x$ ? Then this is would be myopia where the agent either doesn't know or doesn't care about $R_{1}$ , or at least doesn't know or care what its output does to $R_{1}$ . This seems sufficient, but not necessary.

Now I have two things that might describe myopia, so let's use both of them at once! Since you only end up doing gradient descent on $R_{0}$ , it would make sense to say $R^{'} (x) = R (x, y (x))$ , $R_{0} (x) = R (x, y (x_{0}))$ , and hence that $R_{1} (x) = R (x, y (x)) - R (x, y (x_{0}))$ .

Since $R_{1} (x_{0} + δ x) = R_{1} (x_{0}) + δ x \frac{\partial R_{1}}{\partial x}$ for small $δ x$ , this means that $E (\frac{\partial R_{1}}{\partial x}) = 0$ , so substituting in my expression for $R_{1}$ gives $E (\frac{\partial R}{\partial x} + \frac{\partial R}{\partial y} \frac{\partial y}{\partial x} - \frac{\partial R}{\partial x}) = 0$ , so $E (\frac{\partial R}{\partial y} \frac{\partial y}{\partial x}) = 0$ . Uncertainly is only over $R$ , so this is just the claim that the agent will be myopic with respect to $y$ if $E (\frac{\partial R}{\partial y}) = 0$ . So it won't want to include $y$ in its gradient calculation if it thinks the gradients with respect to $y$ are, on average, 0. Well, at least I didn't derive something obviously false!

But Wait There's More

When writing the examples for the gradient descenty formalisation, something struck me: it seems there's a $R (x) = r (x) + \sum_{i} y_{i} (x)$ structure to a lot of them, where $r$ is the reward on the current episode, and $y_{i}$ are rewards obtained on future episodes.

You could maybe even use this to have soft episode boundaries, like say the agent receives a reward $r_{t}$ on each timestep so $R (x) = r_{0} (x) + r_{1} (x) α + r_{2} (x) α^{2} + \sum_{i = 3} r_{i} (x) α^{i}$ , and saying that $α^{3} ≪ 1$ so that $\frac{\partial R}{\partial r_{i}} ≪ 1$ for $i \geq 3$ , which is basically the criterion for myopia up above.

Unrelated Note

On a completely unrelated note, I read the Parable of Predict-O-Matic in the past, but foolishly neglected to read Partial Agency beforehand. The only thing that I took away from PoPOM the first time around was the bit about inner optimisers, coincidentally the only concept introduced that I had been thinking about beforehand. I should have read the manga before I watched the anime.

Replies from: tetraspace-grouping

↑ comment by Tetraspace (tetraspace-grouping) · 2020-05-28T16:58:11.094Z · LW(p) · GW(p)

So the definition of myopia given in Defining Myopia was quite similar to my expansion in the But Wait There's More section; you can roughly match them up by saying $r (x) = \sum_{i} f_{i} r_{i} (x)$ and $y_{i} (x) = (1 - f_{i}) r_{i} (x)$ , where $f_{i}$ is a real number corresponding to the amount that the agent cares about rewards obtained in episode $i$ and $r_{i}$ is the reward obtained in episode $i$ . Putting both of these into the sum gives $R (x) = \sum_{i} r_{i} (x)$ , the undiscounted, non-myopic reward that the agent eventually obtains.

In terms of the $R = R_{0} + R_{1}$ definition that I give in the uncertainty framing, this is $R_{0} = R (x, y_{0}) = \sum_{i} f_{i} r_{i} (x) + \sum_{i} (1 - f_{i}) r_{i} (x_{0})$ , and $R_{1} = R (x, y) - R (x, y_{0}) = \sum_{i} (1 - f_{i}) (r_{i} (x) - r_{i} (x_{0}))$ .

So if you let $r$ be a vector of the reward obtained on each step and $f$ be a vector of how much the agent cares about each step then $x \to x + ϵ \sum_{i} f_{i} \frac{\partial r_{i}}{\partial x}$ , and thus the change to the overall reward is $R \to R + ϵ \sum_{i} \frac{\partial r_{i}}{\partial x} \sum_{j} f_{j} \frac{\partial r_{j}}{\partial x}$ , which can be negative if the two sums have different signs.

I was hoping that a point would reveal itself to me about now but I'll have to get back to you on that one.

comment by Tetraspace (tetraspace-grouping) · 2020-10-23T20:06:05.511Z · LW(p) · GW(p)

I have two questions on Metaculus that compare how good elements of a pair of cryonics techniques are: preservation by Alcor vs preservation by CI, and preservation using fixatives vs preservation without fixatives. They are forecasts of the value (% of people preserved with technique A who are revived by 2200)/(% of people preserved with technique B who are revived by 2200), which barring weird things happening with identity is the likelihood ratio of someone waking up if you learn that they've been preserved with one technique vs the other.

Interpreting these predictions in a way that's directly useful requires some extra work - you need some model for turning the ratio P(revival|technique A)/P(revival|technique B) into plain P(revival|technique X), which is the thing you care about when deciding how much to pay for a cryopreservation.

One toy model is to assume that one technique works (P(revival) = x), but the other technique may be flawed (P(revival) < x). If r < 1, it's the technique in the numerator that's flawed, and if r > 1, it's the technique in the denominator that's flawed. This is what I guess is behind the trimodality in the Metaculus community median: there are peaks at the high end, the low end, and at exactly 1, perhaps corresponding to one working, the other working, and both working.

For the current community medians (as of 2021-04-18), using that model, using the Ergo library, normalizing the working technique to 100%, I find:

Alcor vs CI:

EV(Preserved with Alcor) = 69%
EV(Preserved with Cryonics Institue) = 76%

Fixatives vs non-Fixatives

EV(Preserved using Fixatives) = 83%
EV(Preserved without using Fixatives) = 34%

(here's the Colab notebook)

comment by Tetraspace (tetraspace-grouping) · 2019-12-23T23:51:07.425Z · LW(p) · GW(p)

In the Parable of Predict-O-Matic [LW · GW], a subnetwork of the titular Predict-O-Matic becomes a mesa-optimiser and begins steering the future towards its own goals, independently of the rest of Predict-O-Matic. It does so in a way that sabotages the other subnetworks.

I am reminded of one specification problem that a run of Eurisko faced:

During one run, Lenat noticed that the number in the Worth slot of one newly discovered heuristic kept rising, indicating that Eurisko had made a particularly valuable find. As it turned out the heuristic performed no useful function. It simply examined the pool of new concepts, located those with the highest Worth values, and inserted its name in their My Creator slots.

One thing I wondered is whether this could happen in humans, and if not, why it doesn't. A simplified description of memory that I learned in a flash game is that "neural connections" are "strengthened" whenever they are "used", which sounds sort of like gradients in RL if you don't think about it too hard. Maybe the analogue of this would be some memory that "wants" you to remember it repeatedly at the expense of other memories. Trauma?

Replies from: Zack_M_Davis

↑ comment by Zack_M_Davis · 2019-12-24T00:55:26.394Z · LW(p) · GW(p)

Tulpas??

comment by Tetraspace (tetraspace-grouping) · 2023-04-10T14:03:33.097Z · LW(p) · GW(p)

, as a mathematical structure, is smarter than god and perfectly aligned to $U$ ; the value of $arg max U$ will never actually be $arg max V$ because $V$ is more objectively rational, or because you made a typo and it knows you meant to say $arg max V$ ; and no matter how complicated the mapping is from $a$ to $U (a)$ it will never fall short of giving the $a$ that gives the highest value of $U$ .

Which is why in principle you can align a superior being, like $arg max$ , or maybe like a superintelligence.

comment by Tetraspace (tetraspace-grouping) · 2020-01-02T03:08:52.116Z · LW(p) · GW(p)

Life 3.0 Liveblog/Review Thread

Prelude

The prologue begins with a short story called the Tale of the Omega Team. It's a wish-fulfilment pseudo-isekai about a bunch of effective altruist tech people working for not-Google called the Omegas who make an AGI and then use it to take over the world.

But a cybersecurity specialist on their team talked them out of the game plan [...] risk of Prometheus breaking out and seizing control of its own destiny [...] weren't sure how its goals would evolve [...] go to great lengths to keep Prometheus confined

For some reason, the Omegas in the story claim that the Prometheus (the AI) might be unsafe, and then proceed to do things like have it write software which they then run on computers and let it produce long pieces of animated media and let it send blueprints of technologies to scientists. There is a cybersecurity expert in the team who just barely stops them from straight up leaving the whole thing unboxed, and I do not envy her job position.

(Prometheus is safe, it turns out, which I can tell because there are humans alive at the end of the story.)

[...] Omega-controlled [...] controlled by the Omegas [...] the Omegas harnessed Prometheus [...] the Omegas' [...] the Omegas' [...]

There's also another odd thing where it says that the Omegas are using Prometheus as a tool to do things, instead of what's clearly actually happening which is that Prometheus is achieving its goals with the Omegas being some lumps of atoms that it's been pushing around according to its whims, as it has been since they decided to switch it on.

All-in-all, I like it. It wouldn't be out of place on r/rational, if wish-fulfillment pseudo-isekai does happen then AGI sweeping aside the previous social order will be how (a real AGI would come close to some of the capabilities I've seen those protagonists have), and fiction about more plausible robopocalypses (or roboutopias) coming about is always great.

comment by Tetraspace (tetraspace-grouping) · 2019-08-02T16:50:01.941Z · LW(p) · GW(p)

In Against Against Billionaire Philanthropy, Scott says

The same is true of Google search. I examined the top ten search results for each donation, with broadly similar results: mostly negative for Zuckerberg and Bezos, mostly positive for Gates.

With Gates' philanthropy being about malaria, Zuckerberg's being about Newark schools, and Bezos' being about preschools.

Also, as far as I can tell, Moskovitz' philanthropy is generally considered positively, though of course I would be in a bubble with respect to this. Also also, though I say this without really checking, it seems that people are pretty much all against the Sacklers' donations to art galleries and museums.

Squinting at these data points, I can kind of see a trend: people favour philanthropy that's buying utilons [LW · GW], and are opposed to philanthropy that's buying status. They like billionaires funding global development more than they like billionaires funding local causes, and they like them funding art galleries for the rich least of all.

Which is basically what you'd expect if people were well-calibrated and correctly criticising those who need to be taken down a peg.

Replies from: Richard_Kennaway

↑ comment by Richard_Kennaway · 2019-08-03T15:55:55.089Z · LW(p) · GW(p)

and they like them funding art galleries for the rich least of all.

What are these art galleries "for the rich"? Your link mentions the National Gallery, the Tate Gallery, the Smithsonian, the Louvre, the Guggenheim, the Sackler Museum at Harvard, the Metropolitan Museum of Art, and the American Museum of Natural History as recipients of Sackler money. All of them are open to everyone. The first three are free and the others charge in the region of $15-$25 (as do the National Gallery and the Tate Gallery for special exhibitions, but not the bulk of their displays). The hostility to Sackler money has nothing to do with "how dare they be billionaires", but is because of the (allegedly) unethical practices of the pharmaceutical company that the Sacklers own and owe their fortune to. No-one had any problem with their donations before.

Which is basically what you'd expect if people were well-calibrated and correctly criticising those who need to be taken down a peg.

I see nothing correct in the ethics of the crab bucket.

comment by Tetraspace (tetraspace-grouping) · 2019-08-02T01:37:15.009Z · LW(p) · GW(p)

The simplicity prior is that you should assign a prior probability 2^-L to the description of length L. This sort of makes intuitive sense, since it's what you'd get if you generated the description through a series of coinflips...

... except there are 2^L descriptions of length L, so the total prior probability you're assigning is sum(2^L * 2^-L) = sum(1) = unnormalisable.

You can kind of recover this by noticing that not all bitstrings correspond to an actual description, and for some encodings their density is low enough that it can be normalised (I think the threshold is that less than 1/L descriptions of length L are "valid")...

...but if that's the case, you're being fairly information inefficient because you could compress descriptions further, and why are you judging simplicity using such a bad encoding, and why 2^-L in that case if it doesn't really correspond to complexity properly any more? And other questions in this cluster.

I am confused (and maybe too hung up on something idiosyncratic to an intuitive description I heard).

Replies from: FactorialCode

↑ comment by FactorialCode · 2019-08-02T03:16:32.549Z · LW(p) · GW(p)

Was this meant to be a reply to my bit about the Solmonoff prior?

If so, in the algorithmic information literature, they usually fix the unnormalizability stuff by talking about Prefix Turing machines. Which corresponds to only allowing TM descriptions that correspond to a valid Prefix Code.

But it is a good point that for steeper discounting rates, you don't need to do that.

Replies from: tetraspace-grouping

↑ comment by Tetraspace (tetraspace-grouping) · 2019-08-02T15:01:08.524Z · LW(p) · GW(p)

It was inspired by yours - when I read your post I remembered that there was this thing about Solomonoff induction that I was still confused about - though I wasn't directly trying to answer your question so I made it its own thread.

comment by Tetraspace (tetraspace-grouping) · 2023-04-03T19:54:14.492Z · LW(p) · GW(p)

"The AI does our alignment homework" doesn't seem so bad - I don't have much hope for it, but because it's a prosaic alignment scheme so someone trying to implement it can't constrain where Murphy shows up, rather than because it's an "incoherent path description".

A concrete way this might be implemented is

A language model is trained on a giant text corpus to learn a bunch of adaptations that make it good at math, and then fine-tuned for honesty. It's still being trained at a safe and low level of intelligence where honesty can be checked, so this gets a policy that produces things that are mostly honest on easy questions and sometimes wrong and sometimes gibberish and never superhumanly deceptive.^[1]
It's set to work producing conceptually crisp pieces of alignment math, things like expected utility theory or logical inductors, slowly on inspectable scratchpads and so on, with the dumbest model that can actually factor scientific research^[1], with human research assistants to hold their hand if that lets you make the model dumber. It does this, rather than engineering, because this kind of crisp alignment math is fairly uniquely pinned down so it can be verified, and it's easier to generate compared to any strong pivotal engineering task where you're competing against humans on their own ground so you need to be smarter than humans, so while it's operating in a more dangerous domain it's using a safer level of intelligence.^[1]
The human programmers then use this alignment math to make an corrigible thingy that has dangerous levels of intelligence that does difficult engineering and doesn't know about humans, while this time knowing what they're doing. Getting the crisp alignment math from parallelisable language models helps a lot and gives them a large lead time, because a lot of it's the alignment version of backprop where it would have took a surprising amount of time to discover otherwise [LW · GW].

This all happens at safe-ish low-ish levels of intelligence (such a model would probably be able to autonomously self-replicate on the internet, but probably not reverse protein folding, which means that all the ways it could be dangerous are "well don't do that"s as long as you keep the code secret^[1]), with the actual dangerous levels of optimisation being done by something made by the humans using pieces of alignment math which are constrained down to a tiny number of possibilities.

EDIT 2023-07-25: A longer debate that I think is worth reading about the model that leads it to being an incoherent path description between Holden Karnofsky (pro) and Nate Soares (against) is here [LW · GW]; I hadn't read this as of writing this.

^{^}
Unless it isn't; it's a giant pile of tensors, how would you know? But this isn't special to this use case.

comment by Tetraspace (tetraspace-grouping) · 2023-04-28T16:02:07.675Z · LW(p) · GW(p)

Even more recently I bought a new laptop [LW · GW]. This time, I made the same sheet, multiplied the score from the hard drive by because 512 GB is enough for anyone and that seemed intuitively the amount I prioritised extra hard drive space compared to RAM and processor speed, and then looked at the best laptop before sharply diminishing returns set in; this happened to be the HP ENVY 15-ep1503na 15.6" Laptop - Intel® Core™ i7, 512 GB SSD, Silver. This is because I have more money now, so I was aiming to maximise consumer surplus rather than minimise the amount I was spending.^[1]

Surprisingly, it came with a touch screen! That's just the kind of nice thing that laptops do nowadays, because as I concluded in my post, everything nice about laptops correlates with everything else so high/low end is an axis it makes sense to sort things on. Less surprisingly, it came with a graphics card, because ditto.

Unfortunately this high-end laptop is somewhat loud; probably my next one will be less loud, up to including an explicit penalty for noise.

^{^}
It would have been predictable, however, at the time that I bought that new laptop, that I would have had that much money at a later date. Which means that I should have just skipped straight to consumer surplus maxxing.

comment by Tetraspace (tetraspace-grouping) · 2023-03-03T20:23:00.428Z · LW(p) · GW(p)

Arbital gives a distinction between "logical decision theory" and "functional decision theory" as:

Logical decision theories are a class of decision theories that have a logical counterfactual (vs. the causal counterfactual that CDT has and the evidential counterfactual EDT has).
Functional decision theory is the type of logical decision theory where the logical counterfactual is fully specified, and correctly gives the logical consequences of "decision function X outputs action A".

More recently, I've seen in Decision theory does not imply that we get to have nice things [LW · GW]:

Logical decision theory is the decision theory where the logical counterfactual is fully specified.
Functional decision theory is the incomplete variant of logical decision theory where the logical consequences of "decision function X outputs action A" have to be provided by the setup of the thought experiment.

Any preferences? How have you been using it?

comment by Tetraspace (tetraspace-grouping) · 2019-12-12T21:18:18.607Z · LW(p) · GW(p)

Over the past few days I've been reading about reinforcement learning, because I understood how to make a neural network, say, recognise handwritten digits, but I wasn't sure how at all that could be turned into getting a computer to play Atari games. So: what I've learned so far. Spinning Up's Intro to RL probably explains this better.

(Brief summary, explained properly below: The agent is a neural network which runs in an environment and receives a reward. Each parameter in the neural network is increased in proportion to how much it increases the probability of making the agent do what it just did, and how good the outcome of what the agent just did was.)

Reinforcement learners play inside a game involving an agent and an environment. On turn $t$ , the environment hands the agent an observation $o_{t}$ , and the agent hands the environment an action $a_{t}$ . For an agent acting in realtime, there can be sixty turns a second; this is fine.

The environment has a transition function which takes an observation-action pair $o_{t} a_{t}$ and responds with a probability distribution over observations on the next timestep $o_{t + 1}$ ; the agent has a policy that takes an observation $o_{t}$ and responds with a probability distribution over actions to take $a_{t}$ .

The policy is usually written as $π$ , and the probability that $π$ outputs an action $a$ in response to an observation $o$ is $π (a | o)$ . In practise, $π$ is usually a neural network that takes observations as input and has actions as output (using something like a softmax layer to give a probability distribution); the parameters of this neural network are $θ$ , and the corresponding policy is $π_{θ}$ .

At the end of the game, the entire trajectory $τ = o_{1} a_{1} o_{2} a_{2} \dots o_{T} a_{T}$ is assigned a score, $R (τ)$ , measuring how well the agent has done. The goal is to find the policy $π_{θ}$ that maximises this score.

Since we're using machine learning to maximise, we should be thinking of gradient descent, which involves finding the local direction in which to change the parameters $θ$ in order to increase the expected value of $R$ by the greatest amount, and then increasing them slightly in that direction.

In other words, we want to find $\nabla_{θ} E τ \sim π_{θ} [R (τ)]$ .

Writing the expectation value in terms of a sum over trajectories, this is $\nabla_{θ} \sum_{τ \in D} (P (τ | θ) R (τ))$ = $\sum_{τ \in D} (\nabla_{θ} P (τ | θ) R (τ))$ , where $P (τ | θ)$ is the probability of observing the trajectory $τ$ if the agent follows the policy $π_{θ}$ , and $D$ is the space of possible trajectories.

The probability of seeing a specific trajectory happen is the product of the probabilities of any individual step on the trajectory happening, and is hence $P (τ | θ) = \prod_{t = 1}^{T} π_{θ} (a_{t} | o_{t}) E (o_{t} | a_{t - 1} o_{t - 1})$ where $E (o_{t + 1} | a_{t} o_{t})$ is the probability that the environment outputs the observation $o_{t + 1}$ in response to the observation-action pair $a_{t} o_{t}$ . Products are awkward to work with, but products can be turned into sums by taking the logarithm - $ln P (τ | θ) = \sum_{t = 1}^{T} ln π_{θ} (a_{t} | o_{t}) + ln E (o_{t} | a_{t - 1} o_{t - 1})$ .

The gradient of this is $\nabla_{θ} ln P (τ | θ) = \sum_{t = 1}^{T} \nabla_{θ} ln π_{θ} (a_{t} | o_{t}) + \nabla_{θ} ln E (o_{t} | a_{t - 1} o_{t - 1})$ . But what the environment does is independent of $θ$ , so that entire term vanishes, and we have $\nabla_{θ} ln P (τ | θ) = \sum_{t = 1}^{T} \nabla_{θ} ln π_{θ} (a_{t} | o_{t})$ . The gradient of the policy is quite easy to find, since our policy is just a neural network so you can use back-propagation.

Our expression for the expectation value is just in terms of the gradient of the probability, not the gradient of the logarithm of the probability, so we'd like to express one in terms of the other.

Conveniently, the chain rule gives $\nabla_{θ} ln P (τ | θ) = \frac{1}{P (τ | θ)} \nabla_{θ} P (τ | θ)$ , so $\nabla_{θ} P (τ | θ) = P (τ | θ) \nabla_{θ} ln P (τ | θ)$ . Substituting this back into the original expression for the gradient gives

$\sum_{τ \in D} (P (τ | θ) \nabla_{θ} ln P (τ | θ) R (τ))$ ,

and substituting our expression for the gradient of the logarithm of the probability gives

$\sum_{τ \in D} (P (τ | θ) \sum_{t = 1}^{T} \nabla_{θ} ln π_{θ} (a_{t} | o_{t}) R (τ))$ .

Notice that this is the definition of the expectation value of $\nabla_{θ} ln π_{θ} (a_{t} | o_{t}) R (τ)$ , so writing the sum as an expectation value again we get

$E τ \sim π_{θ} [\sum_{t = 1}^{T} \nabla_{θ} log π_{θ} (a_{t} | s_{t}) R (τ)]$ .

You can then find this expectation value easily by sampling a large number of trajectories (by running the agent in the environment many times), calculating the term inside the brackets, and then averaging over all of the runs.

Neat!

(More sophisticated RL algorithms apply various transformations to the reward to use information more efficiently, and use various gradient descent tricks to use the gradients acquired to converge on the optimal parameters more efficiently)

comment by Tetraspace (tetraspace-grouping) · 2019-08-26T19:15:41.503Z · LW(p) · GW(p)

Here are three statements I believe with a probability of about 1/9:

The two 6-sided dice on my desk, when rolled, will add up to 5.
An AI system will kill at least 10% of humanity before the year 2100.
Starvation was a big concern in ancient Rome's prime (claim borrowed from Elizabeth's Epistemic Spot Check [LW · GW] post).

Except I have some feeling that the "true probability" of the 6-sided die question is pretty much bang on exactly 1/9, but that the "true probability" of the Rome and AI xrisk questions could be quite far from 1/9 and to say the probability is precisely 1/9 seems... overconfident?

From a straightforward Bayesian point of view, there is no true probability. It's just my subjective degree of belief! I'd be willing to make a bet at 8/1 odds on any of these, but not at worse odds, and that's all there really is to say on the matter. It's the number I multiply by the utilities of the outcomes to make decisions.

One thing you could do is imagine a set of hypotheses that I have that involve randomness, and then I have a probability distribution over which of these hypotheses is the true one, and by mapping each hypothesis to the probability it assigns to the outcome my probability distribution over hypotheses becomes a probability distribution over probabilities. This is sharply around 1/9 for the dice rolls, and widely around 1/9 for AI xrisk, as expected, so I can report 50% confidence intervals just fine. Except sensible hypotheses about historical facts probably wouldn't be random, because either starvation was important or it wasn't, that's just a true thing that happens to exist in my past, maybe.

I like jacobjacob's interpretation of a probability distribution over probabilities [LW(p) · GW(p)] as an estimate of what your subjective degree of belief would be if you thought about the problem for longer (e.g. 10 hours). The specific time horizon seems a bit artificial (extreme case: I'm going to chat with an expert historian in 10 hours and 1 minute) but it does work and gives me the kind of results that makes sense. The advantage of this is that you can quite straightforwardly test your calibration (there really is a ground truth) - write down your 50% confidence interval, then actually do the 10 hours of research, and see how often the degree of belief you end up with lies inside the interval.

comment by Tetraspace (tetraspace-grouping) · 2019-08-24T18:08:33.448Z · LW(p) · GW(p)

Imagine two prediction markets, both with shares that give you $1 if they pay out and $0 otherwise.

One is predicting some event in the real world (and pays out if this event occurs within some timeframe) and has shares currently priced at $X.

The other is predicting the behaviour of the first prediction market. Specifically, it pays out if the price of the first prediction market exceeds an upper threshhold $T before it goes below a lower threshhold $R.

Is there anything that can be said in general about the price of the second prediction market? For example, it feels intuitively like if T >> X, but R is only a little bit smaller than X, then assigning a high price to shares of the second prediction market violates conservation of evidence - is this true, and can it be quantified?

comment by Tetraspace (tetraspace-grouping) · 2020-11-12T12:46:50.247Z · LW(p) · GW(p)

Smarkets is currently selling shares in Trump conceding if he loses at 57.14%. The Good Judgement Project's superforecasters predict that any major presidential candidate will concede with probability 88%. I assign <30% probability to Biden conceding* (scenarios where Biden concedes are probably overwhelmingly ones where court cases/recounts mean states were called wrong, which Betfair assigns ~10% probability to, and FTX kind of** assigns 15% probability to, and even these seem high), so I think it's a good bet to take.

* I think that the Trump concedes if he loses market is now unconditional, because by Smarkets' standards (projected electoral votes from major news networks) Biden has won.

** Kind of, because some TRUMP shares expired at 1 TRUMFEB share - $0.10, rather than $0 as expected, and some TRUMP shares haven't expired yet, because TRUMP holders asked. So it's possible that the value of a TRUMPFEB share might also include the value of a hypothetical TRUMPMAR share, or that TRUMPFEB trades will be nullified at some point, or some other retrospective rule change on FTX's part.

UPDATE 2020-11-16: Trump... kind of conceded? Emphasis mine:

He won because the Election was Rigged. NO VOTE WATCHERS OR OBSERVERS allowed, vote tabulated by a Radical Left privately owned company, Dominion, with a bad reputation & bum equipment that couldn’t even qualify for Texas (which I won by a lot!), the Fake & Silent Media, & more!

While he has retracted this, it met Smarkets' standards, so I'm £22.34 richer.

Tetraspace Grouping's Shortform

Contents

22 comments