Posts

Proposal: Scaling laws for RL generalization 2021-10-01T21:32:54.566Z
Forecasting AI Progress: A Research Agenda 2020-08-10T01:04:21.207Z
How can Interpretability help Alignment? 2020-05-23T16:16:44.394Z

Comments

Comment by axioman (flodorner) on Redwood Research’s current project · 2022-06-23T18:10:20.281Z · LW · GW

I think the actual solution is somewhere in between: If we assume calibrated uncertainty, ignore generalization and assume we can perfectly fit the training data, the total cost should be reduced by (1-the probability assigned to the predicted class) * the cost of misclassifying the not predicted (minority) class as the predicted one (majority): If our classifier already predicted the right class, nothing happens, but otherwise we change our prediction to the other class and reduce the total cost. 

While this does not depend on the decision threshold, it does depend on the costs we assign to different misclassifications (in the special case of equal costs, the maximal probability that can be reached by the minority/non-predicted class is 0.5).
Edit: This was wrong, the decision threshold is still implicit at 50% in the first paragraph (as cued by the words "majority" and "minority") : If you apply a 99% decision threshold on a calibrated model, the highest probability you can get for "input is actually unsafe" if your threshold model predicts "safe" is 1%; (now) obviously, you do only get to move examples from predicted "unsafe" to predicted "safe" if you sample close to the 50% threshold, which does not give you much if falsely labelling things as unsafe is not very costly compared to falsely labelling things as safe. 

If we however assume that retraining will only shift the prediction probability by epsilon rather than fully flipping the label, we want to minimize the cost from above, subject to only targeting predictions that are epsilon-close to the threshold (as otherwise there won't be any label flip). In the limit of epsilon->0, we thus should target the prediction threshold rather than 50% (independent of the cost). 

In reality, the extent to which predictions will get affected by retraining is certainly more complicated than suggested by these toy models (and we are still only greedily optimizing and completely ignoring generalization). But it might still be useful to think about which of these assumptions seems more realistic. 

Comment by axioman (flodorner) on AI Performance on Human Tasks · 2022-03-12T12:41:01.120Z · LW · GW

Regarding Image classification performance it seems worth noting that ImageNet was labeled by human labelers (and IIRC there was a paper showing that labels are ambiguous or wrong for a substantial minority of the images). 

As such, I don't think we can conclude too much about superhuman AI performance on Image recognition from ImageNet alone (as perfect performance on the benchmark corresponds to perfectly replicating human judgement, admittedly aggregated over multiple humans). To demonstrate superhuman performance, a dataset with known ground truth were humans struggle to correctly label images would seem more appropriate. 

Comment by axioman (flodorner) on What's the difference between newer Atari-playing AI and the older Deepmind one (from 2014)? · 2021-11-04T23:55:10.644Z · LW · GW

The first thing you mention does not learn to play Atari, and is in general trained quite differently from Atari-playing AI's (as it relies on self-play to kind of automatically generate a curriculum of harder and harder tasks, at least for the some of the more competitive tasks in XLand).

Comment by axioman (flodorner) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-04T23:45:56.142Z · LW · GW

Do you have a source for Agent57 using the same network weights for all games? 

Comment by axioman (flodorner) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-04T23:37:47.201Z · LW · GW

A lot of the omissions you mention are due to inconsistent benchmarks (like the switch from the full Atari suite to Atari 100k with fewer and easier games) and me trying to keep results comparable. 

This particular plot only has each year's SOTA, as it would get too crowded with a higher temporal resolution (I used it for the comment, as it was the only one including smaller-sample results on Atari 100k and related benchmarks). I agree that it is not optimal for eyeballing trends. 

I also agree that temporal trends can be problematic as people did not initially optimize for sample efficiency (I'm pretty sure I mention this in the paper); it might be useful to do a similar analysis for the recent Atari 100k results (but I felt that there was not enough temporal variation yet when I wrote the paper last year as sample efficiency seems to only have started receiving more interest starting in late 2019). 

Comment by axioman (flodorner) on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised · 2021-11-04T23:10:21.781Z · LW · GW

I guess I should update my paper on trends in sample efficiency soon / check whether recent developments are on trend (please message me if you are interested in doing this). This improvement does not seem to be extremely off-trend, but is definitely a bit more than I would have expected this year. Also, note that this results does NOT use the full suite of Atari games, but rather a subset of easier ones. 

Comment by axioman (flodorner) on Proposal: Scaling laws for RL generalization · 2021-10-10T16:28:36.804Z · LW · GW

Your point b) seems like it should also make you somewhat sceptical of any of this accelerating AI capabilities, unless you belief that capabilities-focused actors would change their actions based on forecasts, while safety-focused actors wouldn't. Obviously, this is a matter of degree, and it could be the case that the same amount of action-changing by both actors still leads to worse outcomes.

I think that if OpenAI unveiled GPT4 and it did not perform noticeably better than GPT3 despite a lot more parameters, that would be a somewhat important update. And it seems like a similar kind of update could be produced by well-conducted research on scaling laws for complexity.

Comment by axioman (flodorner) on Proposal: Scaling laws for RL generalization · 2021-10-10T10:35:06.682Z · LW · GW

Most recent large safety projects seem to be focused on language models. So in case the evidence pointed towards problem complexity not mattering that much, I would expect the shift in prioritization towards more RL-safety research to outweigh the effect on capability improvements (especially for the small version of the project, about which larger actors might not care that much). I am also sceptical whether the capabilities of the safety community are in fact increasing exponentially.

I am also confused about the resources/reputation framing. To me this is a lot more about making better predictions when we will get to transformative AI, and how this AI might work, such that we can use the available resources as efficiently as possible by prioritizing the right kind of work and hedging for different scenarios to an appropriate degree. This is particularly true for the scenario where complexity matters a lot (which I find overwhelmingly likely), in which too much focus on very short timelines might be somewhat costly (obviously none of these experiements can remotely rule out short timelines, but I do expect that they could attenuate how much people update on the XLand results).

Still, I do agree that it might make sense to publish any results on this somewhat cautiously.

Comment by axioman (flodorner) on Proposal: Scaling laws for RL generalization · 2021-10-03T07:43:38.559Z · LW · GW

Thank you!

  1. I agree that switching the simulator could be useful where feasible (you'd need another simulator with compatible state- and action-spaces and somewhat similar dynamics.)

  2. It indeed seems pretty plausible that instructions will be given in natural language in the future. However, I am not sure that would affect scaling very much, so I'd focus scaling experiments on the simpler case without NLP for which learning has already been shown to work.

  3. IIRC, transformers can be quite difficult to get to work in an RL setting. Perhaps this is different for PIO, but I cannot find any statements about this in the paper you link.

Comment by axioman (flodorner) on How truthful is GPT-3? A benchmark for language models · 2021-09-17T21:29:54.093Z · LW · GW

I guess finetuning a model to produce truthful statements directly is nontrivial (especially without a discriminator model) because there are many possible truthful and many possible false responses to a question? 

Comment by axioman (flodorner) on We need a new philosophy of progress · 2021-08-29T10:49:16.450Z · LW · GW

Oh, right; I seemed to have confused Gibbard-Satterthwaite with Arrow.

Do you know whether there are other extensions of Arrow's theorem to single-winner elections? Having a voting method return a full ranking of alternatives does not appear to be super important in practice...

Comment by axioman (flodorner) on We need a new philosophy of progress · 2021-08-28T11:31:51.098Z · LW · GW

Doesn't Gibbard's theorem retain most of Arrow's bite?

Comment by axioman (flodorner) on An Intuitive Guide to Garrabrant Induction · 2021-06-05T07:04:24.735Z · LW · GW

Re neural networks: All one billion parameter networks should be computable in polynomial time, but there exist functions that are not expressible by a one billion parameter network (perhaps unless you allow for an arbitrary choice of nonlinearity)

Comment by axioman (flodorner) on An Intuitive Guide to Garrabrant Induction · 2021-06-04T21:02:35.604Z · LW · GW

"If the prices do not converge, then they must oscillate infinitely around some point. A trader could exploit the logical inductor by buying the sentence at a high point on the oscillation and selling at a low one."

I know that this is an informal summary, but I don't find this point intuitively convincing. Wouldn't the trader also need to be able to predict the oscillation? 

Comment by axioman (flodorner) on Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0 · 2021-06-04T20:24:13.527Z · LW · GW

If I understood correctly, the model was trained in Chinese and probably quite expensive to train. 

Do you know whether these Chinese models usually get "translated" to English, or whether there is a "fair" way of comparing models that were (mainly) trained on different languages (I'd imagine that even the tokenization might be quite different for Chinese)?

Comment by axioman (flodorner) on Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0 · 2021-06-04T20:15:39.950Z · LW · GW

I don't really know a lot about performance metrics for language models. Is there a good reason for believing that LAMBADA scores should be comparable for different languages?

Comment by axioman (flodorner) on Systematizing Epistemics: Principles for Resolving Forecasts · 2021-04-02T19:09:51.895Z · LW · GW

"This desiderata is often difficult to reconcile with clear scoring, since complexity in forecasts generally requires complexity in scoring."

Can you elaborate on this? In some sense, log-scoring is simple and can be applied to very complex distributions; Are you saying that the this would still be "complex scoring" because the complex forecast needs to be evaluated, or is your point about something different? 

Comment by axioman (flodorner) on Resolutions to the Challenge of Resolving Forecasts · 2021-03-13T11:24:52.072Z · LW · GW

Partial resolution could also help with getting some partial signal on long term forecasts.

In particular, if we know that a forecasting target is growing monotonously over time (like "date at which X happens" or "cumulative number of X before a specified date"), we can split P(outcome=T) into P(outcome>lower bound)*P(outcome=T|outcome>lower bound). If we use log scoring, we then get log(P(outcome>lower bound)) as an upper bound on the score. 

If forecasts came in the form of more detailed models, it should be possible to use a similar approach to calculate bounds based on conditioning on more complicated events as well. 

Comment by axioman (flodorner) on Promoting Prediction Markets With Meaningless Internet-Point Badges · 2021-02-12T12:54:53.653Z · LW · GW

I don't know what performance measure is used to select superforecasters, but updating frequently seems to usually improve your accuracy score on GJopen as well (see "Activity Loading" in
this thread on the EA forum. )

Comment by axioman (flodorner) on The Multi-Tower Study Strategy · 2021-01-22T18:18:07.015Z · LW · GW

"Beginners in college-level math would learn about functions, the basics of linear systems, and the difference between quantitative and qualitative data, all at the same time."

This seems to be the standard approach for undergraduate-level mathematics at university, at least in Europe. 

Comment by axioman (flodorner) on Avoiding Side Effects in Complex Environments · 2021-01-05T14:24:11.387Z · LW · GW

Makes sense, I was thinking about rewards as function of the next state rather than the current one. 

I can stil imagine that things will still work if we replace the difference in Q-values by the difference in the values of the autoencoded next state. If that was true, this would a) affect my interpretation of the results and b) potentially make it easier to answer your open questions by providing a simplified version of the problem. 

 

Edit: I guess the "Chaos unfolds over time" property of the safelife environment makes it unlikely that this would work? 

Comment by axioman (flodorner) on Avoiding Side Effects in Complex Environments · 2021-01-01T17:03:57.207Z · LW · GW

I'm curious whether AUP or the autencoder/random projection does more work here. Did you test how well AUP and AUP_proj with a discount factor of 0 for the AUP Q-functions do? 

Comment by axioman (flodorner) on Machine learning could be fundamentally unexplainable · 2020-12-18T21:57:05.587Z · LW · GW

"So if you wouldn’t sacrifice >0.01AUC for the sake of what a human thinks is the “reasonable” explanation to a problem, in the above thought experiment, then why sacrifice unknown amounts of lost accuracy for the sake of explainability?" 

You could think of explainability as some form of regularization to reduce overfitting (to the test set). 

Comment by axioman (flodorner) on [AN #128]: Prioritizing research on AI existential safety based on its application to governance demands · 2020-12-11T20:36:00.291Z · LW · GW

"Overall, access to the AI strongly improved the subjects' accuracy from below 50% to around 70%, which was further boosted to a value slightly below the AI's accuracy of 75% when users also saw explanations. "

But this seems to be a function of the AI system's actual performance, the human's expectations of said performance, as well as the human's baseline performance. So I'd expect it to vary a lot between tasks and with different systems. 

Comment by axioman (flodorner) on Nuclear war is unlikely to cause human extinction · 2020-11-13T23:17:14.551Z · LW · GW

"My own guess is that humans are capable of surviving far more severe climate shifts than those projected in nuclear winter scenarios. Humans are more robust than most any other mammal to drastic changes in temperature, as evidenced by our global range, even in pre-historic times"

I think it is worth noting that the speed of climate shifts might play an important role, as a lot of human adaptability seems to rely on gradual cultural evolution. While modern information technology has greatly sped up the potential for cultural evolution, I am unsure if these speedups are robust to a full-scale nuclear war.

Comment by axioman (flodorner) on AI risk hub in Singapore? · 2020-10-30T09:23:21.377Z · LW · GW

I interpreted this as a relative reduction of the probability (P_new=0.84*P_old) rather than an absolute decrease of the probability by 0.16. However, this indicates that the claim might be ambiguous which is problematic in another way. 

Comment by axioman (flodorner) on Comparing Utilities · 2020-09-18T10:29:32.534Z · LW · GW

"The Nash solution differs significantly from the other solutions considered so far. [...]

2. This is the first proposal where the additive constants matter. Indeed, now the multiplicative constants are the ones that don't matter!"

In what sense do additive constants matter here? Aren't they neutralized by the subtraction?

Comment by axioman (flodorner) on Do mesa-optimizer risk arguments rely on the train-test paradigm? · 2020-09-11T12:59:12.450Z · LW · GW

You don't even need a catastrophe in any global sense. Disrupting the training procedure at step t should be sufficient.

Comment by axioman (flodorner) on AI Unsafety via Non-Zero-Sum Debate · 2020-07-14T12:39:18.889Z · LW · GW

"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."

Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?

Comment by axioman (flodorner) on Tradeoff between desirable properties for baseline choices in impact measures · 2020-07-09T09:52:49.372Z · LW · GW

I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.

I am a bit confused about the section on the markov property: I was imagining that the reason you want the property is to make applying standard RL techniques more straightforward (or to avoid making already existing partial observability more complicated). However if I understand correctly, the second modification has the (expectation of the) penalty as a function of the complete agent policy and I don't really see, how that would help. Is there another reason to want the markov property, or am I missing some way in which the modification would simplify applying RL methods?

Comment by axioman (flodorner) on Good and bad ways to think about downside risks · 2020-06-11T17:48:58.104Z · LW · GW

Nice post!

I would like to highlight that a naive application of the expected value perspective could lead to problems like the unilateralist's curse and think that the post would be even more useful for readers who are new to these kinds of considerations if it discussed that more explicitly (or linked to relevant other posts prominently).

Comment by axioman (flodorner) on My prediction for Covid-19 · 2020-06-01T08:46:15.643Z · LW · GW

"If, at some point in the future, we have the same number of contagious people, and are not at an appreciable fraction of group immunity, it will at that point again be a solid decision to go into quarantine (or to extend it). "

I think for many people the number of infections at which this becomes a good idas has increased as we have more accurate information about the CFR and how quickly realistic countermeasures can slow down an outbreak in a given area, which should decrease credence in some of the worst case scenarios many were worried about a few months ago.

Comment by axioman (flodorner) on The case for C19 being widespread · 2020-04-13T14:18:27.701Z · LW · GW

"Czech Researchers claim that Chinese do not work well "

This seems to be missing a word ;)

Comment by axioman (flodorner) on Conflict vs. mistake in non-zero-sum games · 2020-04-07T22:19:52.053Z · LW · GW

Nitpick: I am pretty sure non-zero-sum does not imply a convex Pareto front.

Instead of the lens of negotiation position, one could argue that mistake theorists believe that the Pareto Boundary is convex (which implies that usually maximizing surplus is more important than deciding allocation), while conflict theorists see it as concave (which implies that allocation is the more important factor).

Comment by axioman (flodorner) on March 14/15th: Daily Coronavirus link updates · 2020-03-17T12:23:17.551Z · LW · GW

Twitter: CV kills via cardiac failure, not pulmonary links to the aggragate spreadsheet, not the twitter soruce.

Comment by axioman (flodorner) on Credibility of the CDC on SARS-CoV-2 · 2020-03-08T13:19:34.852Z · LW · GW

Even if the claim was usually true on longer time scales, I doubt that pointing out an organisations mistakes and not entirely truthful statements usually increases the trust in them on the short time scales that might be most important here. Reforming organizations and rebuilding trust usually takes time.

Comment by axioman (flodorner) on Subagents and impact measures, full and fully illustrated · 2020-03-05T19:43:20.722Z · LW · GW

How do

"One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I'll look at the more general situations of rollouts: rollouts for any policy "

and

"That's the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as would produce anything different from ∅, the A becomes completely unrestrained again."

fit together? In the special case where is the inaction policy, I don't understand how the trick would work.

Comment by axioman (flodorner) on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T22:06:56.228Z · LW · GW

For all auxillary rewards. Edited the original comment.

I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.

Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).

Comment by axioman (flodorner) on Attainable Utility Preservation: Scaling to Superhuman · 2020-02-27T20:44:50.349Z · LW · GW

I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state where

for all auxillary rewards , where is the optimal policy according to the main reward; while making sure that there exists an action such that

for every . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at .

Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.

Comment by axioman (flodorner) on How Low Should Fruit Hang Before We Pick It? · 2020-02-27T17:53:12.442Z · LW · GW

Where does

come from?

Also, the equation seems to imply

Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function

at two points, which gives you equality everywhere.

Comment by axioman (flodorner) on How Low Should Fruit Hang Before We Pick It? · 2020-02-27T17:01:50.956Z · LW · GW

I do not understand your proof for proposition 2.

Comment by axioman (flodorner) on On characterizing heavy-tailedness · 2020-02-16T22:43:06.421Z · LW · GW

Do you maybe have another example for action relevance? Nonfinite variance and finite support do not go well together.

Comment by axioman (flodorner) on In theory: does building the subagent have an "impact"? · 2020-02-14T08:00:46.590Z · LW · GW

So the general problem is that large changes in ∅) are not penalized?

Comment by axioman (flodorner) on Appendix: how a subagent could get powerful · 2020-02-12T15:05:40.514Z · LW · GW

"Not quite... " are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.

I am not sure I understand: In my mind "commitments to balance out the original agent's attainable utility" essentially refers to the second agent being penalized by the the first agent's penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to "SA will just precommit to undermine or help A, depending on the circumstances, just sufficiently to keep the expected rewards the same. ".

My confusion is about why the second agent is only mildy constrained by this commitment. For example, weakening the first agent would come with a big penalty (or more precisely, building another agent that is going to weaken it gives a large penalty to the original agent), unless it's reversible, right?

The bit about multiple subagents does not assume that more than one of them is actually built. It rather presents a scenario where building intelligent subagents is automatically penalized. (Edit: under the assumption that building a lot of subagents is infeasible or takes a lot of time).

Comment by axioman (flodorner) on Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1) · 2020-01-21T14:16:24.271Z · LW · GW

I found it a bit confusing that you first reffered to selection and control as types of optimizers and then (seemingly?) replaced selection by optimization in the rest of the text.

Comment by axioman (flodorner) on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-13T13:39:52.494Z · LW · GW

I was thinking about normalisation as linearly rescaling every reward to when I wrote the comment. Then, one can always look at , which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing is the same as maximizing

Comment by axioman (flodorner) on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-12T11:26:23.819Z · LW · GW

After looking at the update, my model is:

(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)

Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the "tipping point" in beliefs, where the opposite extreme policy is suddenly favoured).

In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.

From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the "bad" end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.

Comment by axioman (flodorner) on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-09T21:03:52.334Z · LW · GW

But no matter, how I take the default outcome, your second example is always "more positive sum" than the first, because 0.5 + 0.7 + 2x < 1.5 - 0.1 +2x.

Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to "more negative sum", but this still seems to point to the sum-condition not being the central concept here. To me, it seems like "negative min" compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.

Or am I completely misunderstanding your examples or your point?


Comment by axioman (flodorner) on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2020-01-06T08:07:15.483Z · LW · GW

To clear up some more confusion: The sum-condition is not what actually matters here, is it? In the first example of 5), the sum of utilities is lower than in the second one. The problem in the second example seems to rather be that the best states for one of the (Edit: the expected) rewards are bad for the other?

That again seems like it would often follow from resource constraints.

Comment by axioman (flodorner) on When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors · 2019-12-31T10:04:40.341Z · LW · GW

Right. I think my intuition about negative-sum interactions under resource constrainrs combined the zero-sum nature of resource spending with the (perceived) negative-sum nature of competition for resources. But for a unified agent there is no competition for resources, so the argument for resource constraints leading to negative-sum interactions is gone.

Thank you for alleviating my confusion.