Posts

October 2019 gwern.net newsletter 2019-11-14T20:26:34.236Z · score: 12 (3 votes)
September 2019 gwern.net newsletter 2019-10-04T16:44:43.147Z · score: 22 (4 votes)
"AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 2019-09-10T21:33:08.837Z · score: 14 (4 votes)
August 2019 gwern.net newsletter (popups.js demo) 2019-09-01T17:52:01.011Z · score: 12 (4 votes)
"Designing agent incentives to avoid reward tampering", DeepMind 2019-08-14T16:57:29.228Z · score: 29 (9 votes)
July 2019 gwern.net newsletter 2019-08-01T16:19:59.893Z · score: 24 (5 votes)
How Should We Critique Research? A Decision Perspective 2019-07-14T22:51:59.285Z · score: 49 (12 votes)
June 2019 gwern.net newsletter 2019-07-01T14:35:49.507Z · score: 30 (5 votes)
On Seeing Through 'On Seeing Through: A Unified Theory': A Unified Theory 2019-06-15T18:57:25.436Z · score: 27 (7 votes)
On Having Enough Socks 2019-06-13T15:15:21.946Z · score: 21 (6 votes)
May gwern.net newsletter 2019-06-01T17:25:11.740Z · score: 17 (5 votes)
"One Man's Modus Ponens Is Another Man's Modus Tollens" 2019-05-17T22:03:59.458Z · score: 34 (5 votes)
April 2019 gwern.net newsletter 2019-05-01T14:43:18.952Z · score: 11 (2 votes)
Recent updates to gwern.net (2017–2019) 2019-04-28T20:18:27.083Z · score: 36 (8 votes)
"Everything is Correlated": An Anthology of the Psychology Debate 2019-04-27T13:48:05.240Z · score: 49 (7 votes)
March 2019 gwern.net newsletter 2019-04-02T14:17:38.032Z · score: 19 (3 votes)
February gwern.net newsletter 2019-03-02T22:42:09.490Z · score: 13 (3 votes)
'This Waifu Does Not Exist': 100,000 StyleGAN & GPT-2 samples 2019-03-01T04:29:16.529Z · score: 39 (12 votes)
January 2019 gwern.net newsletter 2019-02-04T15:53:42.553Z · score: 15 (5 votes)
"Forecasting Transformative AI: An Expert Survey", Gruetzemacher et al 2019 2019-01-27T02:34:57.214Z · score: 17 (8 votes)
"AlphaStar: Mastering the Real-Time Strategy Game StarCraft II", DeepMind [won 10 of 11 games against human pros] 2019-01-24T20:49:01.350Z · score: 62 (23 votes)
Visualizing the power of multiple step selection processes in JS: Galton's bean machine 2019-01-12T17:58:34.584Z · score: 27 (8 votes)
Littlewood's Law and the Global Media 2019-01-12T17:46:09.753Z · score: 37 (8 votes)
Evolution as Backstop for Reinforcement Learning: multi-level paradigms 2019-01-12T17:45:35.485Z · score: 18 (4 votes)
December gwern.net newsletter 2019-01-02T15:13:02.771Z · score: 20 (4 votes)
Internet Search Tips: how I use Google/Google Scholar/Libgen 2018-12-12T14:50:30.970Z · score: 54 (13 votes)
November 2018 gwern.net newsletter 2018-12-01T13:57:00.661Z · score: 35 (8 votes)
October gwern.net links 2018-11-01T01:11:28.763Z · score: 31 (8 votes)
Whole Brain Emulation & DL: imitation learning for faster AGI? 2018-10-22T15:07:54.585Z · score: 15 (5 votes)
New /r/gwern subreddit for link-sharing 2018-10-17T22:49:36.252Z · score: 45 (13 votes)
September links 2018-10-08T21:52:10.642Z · score: 18 (6 votes)
Genomic Prediction is now offering embryo selection 2018-10-07T21:27:54.071Z · score: 39 (14 votes)
August gwern.net links 2018-09-25T15:57:20.808Z · score: 18 (5 votes)
July gwern.net newsletter 2018-08-02T13:42:16.534Z · score: 24 (8 votes)
June gwern.net newsletter 2018-07-04T22:59:00.205Z · score: 36 (8 votes)
May gwern.net newsletter 2018-06-01T14:47:19.835Z · score: 73 (14 votes)
$5m cryptocurrency donation to Alcor by Brad Armstrong in memory of LWer Hal Finney 2018-05-17T20:31:07.942Z · score: 48 (12 votes)
Tech economics pattern: "Commoditize Your Complement" 2018-05-10T18:54:42.191Z · score: 97 (27 votes)
April links 2018-05-10T18:53:48.970Z · score: 20 (6 votes)
March gwern.net link roundup 2018-04-20T19:09:29.785Z · score: 27 (6 votes)
Recent updates to gwern.net (2016-2017) 2017-10-20T02:11:07.808Z · score: 7 (7 votes)
The NN/tank Story Probably Never Happened 2017-10-20T01:41:06.291Z · score: 2 (2 votes)
Regulatory lags for New Technology [2013 notes] 2017-05-31T01:27:52.046Z · score: 5 (5 votes)
"AIXIjs: A Software Demo for General Reinforcement Learning", Aslanides 2017 2017-05-29T21:09:53.566Z · score: 4 (4 votes)
Keeping up with deep reinforcement learning research: /r/reinforcementlearning 2017-05-16T19:12:04.201Z · score: 3 (4 votes)
"The unrecognised simplicities of effective action #2: 'Systems engineering’ and 'systems management' - ideas from the Apollo programme for a 'systems politics'", Cummings 2017 2017-02-17T00:59:04.256Z · score: 9 (8 votes)
Decision Theory subreddit 2017-02-07T18:42:55.277Z · score: 6 (7 votes)
Rationality Heuristic for Bias Detection: Updating Towards the Net Weight of Evidence 2016-11-17T02:51:19.316Z · score: 10 (11 votes)
Recent updates to gwern.net (2015-2016) 2016-08-26T19:22:02.157Z · score: 27 (29 votes)
The Brain Preservation Foundation's Small Mammalian Brain Prize won 2016-02-09T21:02:02.585Z · score: 43 (45 votes)

Comments

Comment by gwern on Do we know if spaced repetition can be used with randomized content? · 2019-11-17T23:29:18.564Z · score: 9 (6 votes) · LW · GW

I proposed this idea years back as dynamic or extended flashcards. Because spaced repetition works for learning abstractions, which studies presumably entail learning from a set of testing flashcards to a set of validation flashcards, there doesn't seem to be any reason to expect SRS to fail when the testing flashcard set is itself very large or randomly-generated. Khan Academy may be an example of this: they are supposed to use spaced repetition in scheduling reviews, apparently based on Leitner, and they also apparently use randomly generated or at least templated questions in some lessons (just mathematics?).

(Incidentally, while we're discussing spaced repetition variations, I'm also pleased with my idea of "anti-spaced repetition" as useful for reviewing notes or scheduling media consumption.)

Comment by gwern on An optimal stopping paradox · 2019-11-12T18:51:14.133Z · score: 4 (2 votes) · LW · GW

Claim that there should be a finite lifetime. You can't wait forever. If there is a finite lifetime, then the same decision analysis would tell you to procrastinate until the very end. This effectively is procrastinating forever. It does not converge to a reasonable finite waiting time as your lifetime goes to infinity.

If I am a quasi-immortal who will live millions or billions of years, with, apparently, zero discount rates, no risk, and nothing else I am allowed to invest in (no opportunity cost), why shouldn't I make investment decisions which take millions of years to mature (with astronomical loads of utility at the end as a payoff for my patience), and plan over periods that short-lived impatient mayflies like yourself can scarcely comprehend?

Comment by gwern on Experiments and Consent · 2019-11-12T03:08:25.889Z · score: 23 (5 votes) · LW · GW

The claim was that A/​B test­ing was “not as good a tool for mea­sur­ing long term changes in be­hav­ior” and I’m say­ing that A/​B test­ing is a very good tool for that pur­pose.

And the paper you linked showed that it wasn't being done for most of Google's history. If Google doesn't do it, I would be doubtful if anyone, even a peer like Amazon, does. Is it such a good tool if no one uses it?

By 2013 they were cer­tainly already tak­ing into ac­count long-term value, even on mo­bile (which was pretty small un­til just around 2013). This sec­tion isn’t say­ing “we set the thresh­old for the num­ber of ads to run too high” but “we were able to use our long-term value mea­sure­ments to bet­ter figure out which ads not to run”.

Which is just another way of saying that before then they hadn't used their long-term value measurements to figure out what threshold of ads to run before. Whether 2015 or 2013, this is damning. (As are, of course, the other ones I collate, with the exception of Mozilla who don't dare make an explosive move like shipping adblockers installed by default, so the VoI to them is minimal.)

The result which would have been exculpatory is if they said, "we ran an extra-special long-term experiment to check we weren't fucking up anything, and it turns out that, thanks to all our earlier long-term experiments dating back many years which were run on a regular basis as a matter of course, we had already gotten it about right! Phew! We don't need to worry about it after all. Turns out we hadn't A/B-tested our way into a user-hostile design by using wrong or short-sighted metrics. Boy it sure would be bad if we had designed things so badly that simply reducing ads could increase revenue so much." But that is not what they said.

Comment by gwern on Experiments and Consent · 2019-11-11T22:07:25.878Z · score: 19 (5 votes) · LW · GW

And, as that paper inadvertently demonstrates (among others, including my own A/B testing), most companies manage to not run any of those long-term experiments and do things like overload ads to get short-term revenue boosts at the cost of both user happiness and their own long-term bottom line.

That includes Google: note that at the end of a paper published in 2015, for a company which has been around for a while in the online ad business, let us say, they are shocked to realize they are running way too many ads and can boost revenue by cutting ad load.

Ads are the core of Google's business and the core of all A/B testing as practiced. Ads are the first, second, third, and last thing any online business will A/B test, and if there's time left over, maybe something else will get tested. If even Google can fuck that up for so long so badly, what else are they fucking up UI-wise? A fortiori, what else is everyone else online fucking up even worse?

Comment by gwern on Pieces of time · 2019-11-11T18:53:40.635Z · score: 23 (9 votes) · LW · GW

One of the unexpected side-effects I noticed while doing Uberman polyphasic sleep in my various failed attempts way back in 2009 or so was an unpleasant sensation of being unmoored in time: with a lot of little naps wrapping around the clock, there were no clear 'start' or 'end' times, just one day sliding into another. (I get a similar feeling, at a much lower level, when I travel in the Midwest.) The chronic tiredness and mental dullness from the polyphasic sleep didn't help either.

Comment by gwern on Recent updates to gwern.net (2016-2017) · 2019-11-10T19:46:52.490Z · score: 3 (1 votes) · LW · GW

Secondly, if your interpretation were his intended one, he could have done any number of things to suggest it!

He did do any number of things to suggest it!

Nor do any of his out-of-universe quotes indicate he misunderstands. For example, just recently the topic of time travel came up on Hsu's podcast and Chiang says

...the first Terminator film does posit a fixed timeline. And you know, this is something I'm interested in, and yeah, there's a sense in which "What's Expected Of Us" falls into this category, also the story "The Merchant and the Alchemist's Gate" falls into this category, and there's even a sense in which for my first collection, "Story of Your Life", falls in this category.

Actually being able to see the future, in terms of information flowing backwards, in a self-consistent timeline is what is "What's Expected Of Us" considers; and "The Merchant and the Alchemist's Gate" uses physical movement backwards. What then, makes "Story of Your Life" not in the same category as either of those (especially the former) and in fact, doing something so different that it has to be qualified as the very vague 'even a sense in which'? (Because in "Story", the 'time travel' is pseudo time travel, involving neither information nor matter moving backwards in time, and is purely a psychological perspective.)

If your interpretation were Chiang's, he would have to be intentionally misdirecting the audience to a degree you only see from authors like Nabokov, and not leaving any clue except for flawed science, which is common enough for dramatic license in science fiction that it really can't count as a clue. I doubt Chiang is doing that.

I don't mind comparing Chiang with a writer like Nabokov. Nabokov is like Chiang in some ways - for example, they are both very interested in science (eg Nabokov's contributions to lepidopterology).

It's just a more boring story the way you see it!

I strongly disagree. Making Louise some sort of 'Cassandra' with handwavy woo quantum SF is thoroughly boring. The psychological version is much more interesting and far more worthy of 'speculative fiction' and Chiang's style of worldbuilding.

Comment by gwern on For the metaphors · 2019-11-10T00:38:25.509Z · score: 7 (3 votes) · LW · GW

Wittgenstein has another similar metaphor (Zettel, pg 934):

447. Disquiet in philosophy might be said to arise from looking at philosophy wrongly, seeing it wrong, namely as if it were divided into (infinite) longitudinal strips instead of into (finite) cross strips. This inversion in our conception produces the greatest difficulty. So we try as it were to grasp the unlimited strips and complain that it cannot be done piecemeal. To be sure it cannot, if by a piece one means an infinite longitudinal strip. But it may well be done, if one means a cross-strip.

--But in that case we never get to the end of our work!--Of course not, for it has no end. (We want to replace wild conjectures and explanations by quiet weighing of linguistic facts.)

Comment by gwern on Building Intuitions On Non-Empirical Arguments In Science · 2019-11-09T02:58:59.246Z · score: 7 (3 votes) · LW · GW

"There is no view from nowhere." Your mind was created already in motion and thinks, whether you want it to or not, and whatever ontological assumptions it may start with, it has pragmatically already started with them years before you ever worried about such questions. Your Neurathian raft has already been replaced many times over on the basis of decisions and outcomes.

Comment by gwern on Normative reductionism · 2019-11-05T20:39:19.227Z · score: 5 (2 votes) · LW · GW

Sounds like a Markov property.

Comment by gwern on [Question] When Do Unlikely Events Should Be Questioned? · 2019-11-04T20:22:06.681Z · score: 4 (2 votes) · LW · GW

I don't really know. The likelihood of 'generating an amusing coincidence you can post on social media' is clearly quite high: your 1/160,000 merely examines one kind of amusement, and so obviously is merely an extremely loose lower bound. The more kinds of coincidences you enumerate, the bigger the total likelihood becomes, especially considering that people may be motivated to manufacture stories. Countless examples (but here's a fun recent example on confabulating stories for spurious candidate-gene hits). The process is so heterogeneous and differs so much by area (be much more skeptical of hate crime reports than rolling nat 20s), that I don't think there's really any general approach other than to define a reference class, collect a sample, factcheck, and see how many turn out to be genuine... A lot of SSC posts go into the trouble we have with things like this, such as the 'lizardman constant' or rape accusation statistics.

Personally, considering how many rounds there are in any D&D game, how often one does a check, how many players running games there are constantly, how many people you know within 1 or 2 hops on social media, a lower bound of 1/160,000 for a neutral event is already more than frequent enough for me to not be all that skeptical; as Littlewood notes of his own examples, many involving gambling, on a national basis, such things happen frequently.

Comment by gwern on [Question] When Do Unlikely Events Should Be Questioned? · 2019-11-03T18:58:57.808Z · score: 13 (3 votes) · LW · GW

I don't believe there is any such estimate because it is fundamentally derivative of human psychology and numerology and culture. Why is 168 a remarkable number but 167 is not? Because of an accident of Chinese telephones. And so on. There is no formula for those. Look at Littlewood's examples or Diaconis & Mosteller 1989. These things do happen.

And you can expand the space of possibilities even more. What if the same person gets 4 dice in a row within a turn? Across 4 turns? What if the first player gets 1 dice, then the next player gets the same dice, and so on? Would not all of those be remarkable? And note that it would be incorrect to do 'p^4' because you are looking at a sliding window over an indefinitely long series of rolls: anywhere in that could be the start of a run of good luck, every roll offers the potential to start a run.

Comment by gwern on Rationality Quotes: July 2010 · 2019-11-01T22:13:51.929Z · score: 8 (3 votes) · LW · GW

I can't find any source for this, so it may be apocryphal.

Comment by gwern on Is requires ought · 2019-10-28T23:17:46.883Z · score: 5 (2 votes) · LW · GW

There's also quantum decision theory. The way I'd put this is, "beliefs are for actions".

Comment by gwern on [deleted post] 2019-10-21T17:49:20.578Z

Here is a proposed solution. Let X be the traditional signal, Y be the new signal, and Z be the trait(s) being advertised by both. Let people continue doing X, but subsidize Y on top of X for people with very high Z. Soon Y is a signal of higher Z than X is, and understood by the recipients of the signals to be a better indicator. People who can’t afford to do both should then prefer Y to X, since Y is is a stronger signal, and since it is more socially efficient it is likely to be less costly for the signal senders.

I don't see how this validates the new test. If you filter it like that, then you have range restriction problems: you've shown the scores of the MIT graduates are such and such, but you haven't shown that the scores of the non-MIT graduates are less than that, and so you haven't shown that it filters at all, much less as well as the 'MIT vs non-MIT'. (Imagine the test is 'how tall you are in centimeters'.) You still also have the adverse selection problem for the new test.

Comment by gwern on Billion-scale semi-supervised learning for state-of-the-art image and video classification · 2019-10-19T22:39:20.592Z · score: 7 (3 votes) · LW · GW

I'm not sure. Typically, the justification for these sorts of distillation/compression papers is purely compute: the original teacher model is too big to run on a phone or as a service (Hinton), or too slow, or would be too big to run at all without 'sharding' it somehow, or it fits but training it to full convergence would take too long (Gao). You don't usually see arguments that the student is intrinsically superior in intelligence and so 'amplified' in any kind of AlphaGo-style way which is one of the more common examples for amplification. They do do something which sorta looks iterated by feeding the pseudo-labels back into the same model:

In order to achieve the state of the art, our researchers used the weakly supervised ResNeXt-101-32x48 model teacher model to select pretraining examples from the same data set of one billion hashtagged images. The target ResNet-50 model is pretrained with the selected examples and then fine-tuned with the ImageNet training data set. The resulting semi-weakly supervised ResNet-50 model achieves 81.2 percent top-1 accuracy. This is the current state of the art for the ResNet-50 ImageNet benchmark model. The top-1 accuracy is 3 percent higher than the (weakly supervised) ResNet-50 baseline, which is pretrained and fine-tuned on the same data sets with exactly the same training data set and hyper-parameters.

But this may top out at one or 2 iterations, and they don't demonstrate that this would be better than any other clearly non-iterated semi-supervised learning method (like MixMatch).

Comment by gwern on Billion-scale semi-supervised learning for state-of-the-art image and video classification · 2019-10-19T17:56:40.095Z · score: 3 (1 votes) · LW · GW

a sort-of-similar-to amplification component where a higher capacity teacher decides how to train a lower capacity student model. This is the first example I've seen of this overseer/machine-teaching style approach scaling up to such a data-hungry classification task.

What's special there is the semi-supervised part (the training on unlabeled data to get pseudo-labels to then use in the student model's training). Using a high capacity teacher on hundreds of millions of images is not all that new: for example, Google was doing that on its JFT dataset (then ~100m noisily-labeled images) back in at least 2015, given "Distilling the Knowledge in a Neural Network", Hinton, Vinyals & Dean 2015. Or Gao et al 2017 which goes the other direction and tries to distill dozens of teachers into a single student using 400m images in 100k classes.

(See also: Gross et al 2017/Sun et al 2017/Gao et al 2017/Shazeer et al 2018/Mahajan et al 2018/Yalniz et al 2019 or GPipe scaling to 1663-layer/83.4b-parameter Transformers)

Comment by gwern on (a) · 2019-10-17T14:42:06.982Z · score: 5 (2 votes) · LW · GW

But even "(a)" probably won't spread far (10-20 seconds of friction per link is too much for almost everyone). Maybe there's room for a company doing this as a service...

If adoption is your only concern, doing it website by website is hopeless in the first place. Your only choice is creating some sort of web browser plugin to do it automatically.

Comment by gwern on (a) · 2019-10-17T02:28:39.662Z · score: 6 (3 votes) · LW · GW

Certainly there are links which are regularly updated, like Wikipedia pages. They should be whitelisted. There are others which wouldn't make any sense to archive, stuff like services or tools - something like Waifu Labs which I link in several places wouldn't make much sense to 'archive' because the entire point is to interact with the service and generate images.

But examples like blogs or LW pages make sense to archive after a particular timepoint. For example, many blogs or websites like Reddit lock comments after a set number of days. Once that's passed, typically nothing in the page will change substantially except to be deleted. I think most of my links to blogs are of that type.

Even on LW, where threads can be necroed at any time, how often does anyone comment on an old post, and if your archived copy happens to omit some stray recent comments, how big a deal is that? Acceptable collateral damage compared to a website where 5 or 10% of links are broken and the percentage keeps increasing with time, I'd say...

For this issue, you could implement something like a 'first seen' timestamp in your link database and only create the final archive & substituting after a certain time period - I think a period like 3 months would capture 99% of the changes which are ever going to be made, while not risking exposing readers to too much linkrot.

Comment by gwern on (a) · 2019-10-17T00:39:14.986Z · score: 36 (8 votes) · LW · GW

I'm not a fan of the (a) notation, or any link annotation for providing an archive link, since it's not useful for the reader. An archive link is infrastructure: most readers don't care about whether a link is an archive link or not. They might care about whether it's a PDF or not, or whether it's a website they like or dislike, but they don't care whether a link works or not - the link should Just Work. 'Silence is golden.'

Any new syntax should support either important semantics or important functionality. But this doesn't support semantics, because it essentially doubles the overhead of every link by using up 4 characters (space, 2 parentheses, and 'a') without, unlike regular link annotations (1 character length), actually providing any new information (since you can generally assume that a link is available in the IA or elsewhere if the author cares enough about the issue to do something like include (a) links!). And the functionality is one that will be rarely exercised by users, who will click on only a few links and will click on the archived version for only a small subset of said links, unless link rot is a huge issue - in which case, why are you linking to the broken link at all instead of the working archived version? (Users can also implement this client-side by a small JS which inserts a stub link to IA, which is another red flag: if something is so simple and mindless that it can be done client-side at runtime, what value is it adding, really?)

So my opinion on archive links is that you should ensure that archives exist at all, you should blacklist domains which are particularly unreliable (I have a whole list of domains I distrust in my lint script, like SSRN or ResearchGate or Medium) and pre-emptively link to archived versions of them to save everyone the grief, and you should fix other links as they die. But simply mindlessly linking to archive links for all external links adds a lot of visual clutter and is quite an expense.


One place where archive links might make sense are places where you can't or won't update the page but you also don't want to pre-emptively use archive links for everything. If you are writing an academic paper, say, the journal will not let you edit the PDF as links die, and you ideally want it to be as useful in a few decades as it is now; academic norms tend to frown on not linking the original URL, so you are forced to include both. (Not particularly applicable to personal websites like Zuck or Guzey or my site.)

Another idea is a more intelligent hybrid: include the archive links only on links suspected or known to be dead. For example, at compile time, links could be checked every once in a while, and broken links get an annotation; ideally the author would fix all such links, but of course in practice many broken links will be neglected indefinitely... This could also be done at runtime: when the user mouses over a link, similar to my preview link previews, a JS library could prefetch the page and if there is an outright error, rewrite it to an archive link. (Even fancier: links could come with perceptual hashes and the JS library could prefetch and check that the page looks right - this catches edgecases like paywalls, silent redirects, deletions, etc.)


An open question for me is whether it makes sense to not pre-emptively archive everything. This is a conclusion I have tried to avoid in favor of a more moderate strategy of being careful in linking and repairing dead links, but as time passes, I increasingly question this. I find it incredibly frustrating just how many links continuously die on gwern.net, and die in ways that are silent and do not throw errors, like redirecting PDFs to homepages. (Most egregiously, yesterday I discovered from a reader complaint that CMU's official library repository of papers has broken all PDF links and silently redirected them to the homepage. WTF!) It is only a few percent at most (for now...), but I wish it were zero percent. I put a great deal of work into my pages, making them be well-referenced, well-linked, thorough, and carefully edited down to the level of spelling errors and nice typography and link popup annotations etc, and then random external links die in a way I can't detect and mar the experience for readers! Maaaaan.

When I think about external links, it seems that most of them can be divided into either reliable links which will essentially never go down (Wikipedia) and static links which rarely or never change could just as well be hosted on gwern.net (practically every PDF link ever). The former can be excluded from archiving, but why not just host all of the latter on gwern.net to begin with? It wouldn't be hard to write a compile-time Pandoc plugin to automatically replace every external link not on a whitelist with a link to a local ArchiveBox static mirror. It might cost a good 30GB+ of space to host ~10k external links for me, but that's not that much, prices fall a little every year, and it would save a great deal of time and frustration in the long run.

Comment by gwern on Daniel Kokotajlo's Shortform · 2019-10-15T16:12:40.266Z · score: 9 (4 votes) · LW · GW

Expensive specialized tools are themselves learned by and embedded inside an agent to achieve goals. They're simply meso-optimization in another guise. eg AlphaGo learns a reactive policy which does nothing which you'd recognize as 'planning' or 'agentiness' - it just maps a grid of numbers (board state) to another grid of numbers (value function estimates of a move's value). A company, beholden to evolutionary imperatives, can implement internal 'markets' with 'agents' if it finds that useful for allocating resources across departments, or use top-down mandates if those work better, but no matter how it allocates resources, it's all in the service of an agent, and any distinction between the 'tool' and 'agent' parts of the company is somewhat illusory.

Comment by gwern on Planned Power Outages · 2019-10-15T01:45:23.114Z · score: 21 (7 votes) · LW · GW

There's an awkward valley between "reasonably reliable, but with a major outage every few years in a storm or something" and "completely reliable, and you can trust your life on it" where the system is reliable enough that we stop thinking of it as something that might go away but it's not so reliable that we should.

Apropos of my other comment on SRE/complex system failure applications to writing/math, this is a known practice: if a service is too reliable for a time and has exceeded its promised 'error budget', it will be deliberately taken down to make sure the promised number of errors happen.

From ch4

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. For example, we might decide that we will return Shakespeare search results "quickly," adopting an SLO that our average search request latency should be less than 100 milliseconds...Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service being slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is (as happened with Chubby: see "The Global Chubby Planned Outage"), and under-reliance, when prospective users believe a system is flakier and less reliable than it actually is.

"The Global Chubby Planned Outage"

[Written by Marc Alvidrez]

Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region. Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.

The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.

...Don’t overachieve:

Users build on the reality of what you offer, rather than what you say you’ll supply, particularly for infrastructure services. If your service’s actual performance is much better than its stated SLO, users will come to rely on its current performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s Chubby service introduced planned outages in response to being overly available),18 throttling some requests, or designing the system so that it isn’t faster under light loads.

Comment by gwern on Bets and updating · 2019-10-11T01:45:21.693Z · score: 14 (3 votes) · LW · GW

If you update too little, and accept too many bets, you will lose a lot of money to people with better information than you. On the other hand, you can also go too far in the other direction. If your response to being offered a five-cent bet is to immediately update to accept their probabilities (and refuse the bet), you will be very easy to fool (although hard to exploit by betting).

This, incidentally, is a real bookie thing. They worry constantly about informed bettors making money off them, but they also want the 'flow' in order to update their odds. So there's a constant cat-and-mouse between the betting syndicates who have informational edges of various sorts, and the bookmakers, who try to limit the bettors to small amounts while then turning around and updating their odds based on the revealed information (to the bettors' perennial frustration, as the bookmakers can close accounts unilaterally or seize winnings).

Comment by gwern on Who lacks the qualia of consciousness? · 2019-10-07T02:24:53.783Z · score: 9 (4 votes) · LW · GW

Yes, and consciousness. Although not to the same level as humans.

Highly debatable.

Animals also visualize don't they?

How would you know if they didn't and were aphantasic? Did you ask them?

Comment by gwern on Who lacks the qualia of consciousness? · 2019-10-06T23:12:23.701Z · score: 4 (3 votes) · LW · GW

Animals have kinesthetic experiences and emotions too.

Comment by gwern on What are your strategies for avoiding micro-mistakes? · 2019-10-06T17:42:05.021Z · score: 24 (8 votes) · LW · GW

This, I think, may be too domain-specific to really be answerable in any useful way. Anyway, more broadly: when you run into errors, it's always good to think sort of like pilots or sysadmins in dealing with complex system failures - doing research and making errors is certainly a complex system, where there are many steps where errors could be caught. What are the root causes, how did the error start propagating, and what could have been done throughout the stack to reduce it?

  1. constrain the results: Fermi estimates, informative priors, inequalities, and upper/lower bounds are all good for telling you in advance roughly what the results should be, or at least building intuition about what you expect

  2. implement in code or theorem-checker; these are excellent for flushing out hidden assumptions or errors. As Pierce puts it, proving a theorem about your code uncovers many bugs - and it doesn't matter what theorem!

  3. solve with alternative methods, particularly brute force: solvers like Mathematica/Wolfram are great just to tell you what the right answer is so you can check your work. In statistics/genetics, I often solve something with Monte Carlo (or ABC) or brute force approaches like dynamic programming, and only then, after looking at the answers to build intuitions (see: #1), do I try to tackle an exact solution.

  4. test the results: unit test critical values like 0 or the small integers, or boundaries, or very large numbers; use property-based checking (I think also called 'metamorphic testing') like QuickCheck to establish that basic properties seem to hold (like always being positive, monotonic, input same length as output etc)

  5. ensemble yourself: wait a while and sleep on it, try to 'rubber duck' it to activate your adversarial reasoning skills by explaining it, go through it in different modalities

  6. generalize the results, so you don't have to resolve it: the most bugfree code or proof is the one you never write.

  7. When you run into an error, think about it: how could it have been prevented? If you read something like Site Reliability Engineering: How Google Runs Production Systems or other books about failure in complex systems, you might find some useful inspiration. There is a lot of useful advice: for example, you should have some degree of failure in a well-functioning system; you should keep an eye on 'toil' versus genuinely new work and step back and shave some yaks when 'toil' starts getting out of hand; you should gradually automate manual workflows, perhaps starting from checklists as skeletons

    Do you need to shave some yaks? Are your tools bad? Is it time to invest in learning to use better programs or formalization methods?

    If you keep making an error, how can it be stopped?

    If it's a simple error of formula or manipulation, perhaps you could make a bunch of spaced repetition system flashcards with slight variants all stressing that particular error.

    Is it machine-checkable? For writing my essays, I flag as many errors or warning signs using two scripts and additional tools like linkchecker.

    Can you write a checklist to remind yourself to check for particular errors or problems after finishing?

    Follow the 'rule of three': if you've done something 3 times, or argued at length the same thesis 3 times, etc, it may be time to think about it more closely to automate it or write down a full-fledged essay. I find this useful for writing because if something comes up 3 times, that suggests it's important and underserved, and also that you might save yourself time in the long run by writing it now. (This is my theory for why memory works in a spaced repetition sort of way: real world facts seems to follow some sort of long-tailed or perhaps mixture distribution, where there are massed transient facts which can be safely forgotten, and long-term facts which pop up repeatedly with large spacings, so ignoring massed presentation but retaining facts which keep popping up after long intervals is more efficient than simply having memories be strengthened in proportion to total number of exposures.)

Comment by gwern on Who lacks the qualia of consciousness? · 2019-10-06T00:47:05.554Z · score: 9 (5 votes) · LW · GW

People with Cotard delusion seem to come close. And there is of course Simon Browne, who explicitly claimed to be "utterly divested of consciousness."

And you have to wonder: long-term memory (SDAM) seems closely associated with aphantasia; and many people vary drastically on levels of 'internal monologue'. If you take someone with no internal monologue, aphantasia, and SDAM, what's left?

Comment by gwern on How good is the case for retraining yourself to sleep on your back? · 2019-10-02T23:46:19.982Z · score: 19 (6 votes) · LW · GW

I looked into this question a number of years ago, as a side-sleeper wondering if and whether I should try to sleep on my back, which would simplify things considerably. I couldn't find anything worth citing, and concluded either I was looking under the wrong keywords or the desired research doesn't exist. Since my own efforts to try to switch failed, I gave up and bought a real body pillow, which seemed to help.

Comment by gwern on [Link] What do conservatives know that liberals don't (and vice versa)? · 2019-10-02T17:19:01.582Z · score: 7 (3 votes) · LW · GW

One suggestion would be to datamine the GSS: look for items which most discriminate between partisan affiliation, which would reflect factual claims.

Comment by gwern on [Site Feature] Link Previews · 2019-10-01T23:33:08.776Z · score: 16 (5 votes) · LW · GW

One problem with link coloring is that link coloring is already used by the browser as semantic annotation and has been for decades by every web browser I am aware of: specifically, whether a link has been visited before or is novel. When I look at your screenshot, I can't read it as 'off vs on site', I can only read it as 'ah, Raemon has not yet read about foreign Sequences'. It's too ingrained, and the colors are fighting >20 years of browser conditioning. Adding more colors to overload coloring doesn't help this, as that makes even more to learn (what would it be, dark green for 'unread on-site', light green for 'read on-site' etc?). It is also somewhat difficult to tune them, and more obscurely, you have issues with color-blindness (depending on the colors you pick, some <10% amount of readers will struggle or be entirely unable to perceive the difference).

Link icons, on the other hand, are additions, rather than overloads, used at least somewhat occasionally online already, can be understood by anyone who isn't blind (assuming grayscale like mine), and relatively self-explanatory (assuming good choice of logos).

Comment by gwern on The YouTube Revolution in Knowledge Transfer · 2019-09-30T17:01:06.000Z · score: 13 (7 votes) · LW · GW

I ran into an example of this recently. An older Californian ranger was telling us about his two experiences homesteading by himself, 20 years ago and recently. He learned far more from the second time and had a much better time in general. Why? Youtube! When he hit a problem like a chainsaw not working, he could fire up Youtube and watch videos until he had an idea what to do. This made things far faster and more pleasant and he learned much more from his time.

I noticed that it sounded very much like 'gamification': what were nigh-insurmountable problems before, leading to getting stuck, are suddenly reduced to difficult but soluble problems which could be tackled one by one with rapid iteration, feedback, and reward.

Comment by gwern on Meetups as Institutions for Intellectual Progress · 2019-09-19T00:25:19.858Z · score: 15 (4 votes) · LW · GW

The role of rapporteur, in its many incarnations from secretary to FAQ maintainer, is an oft underestimated one.

Comment by gwern on [Site Feature] Link Previews · 2019-09-18T22:08:07.041Z · score: 11 (5 votes) · LW · GW

The basic idea behind the popups is that at the Markdown->HTML compile time, I do a lookup in a hashmap of (URL, (Title, Author, Date, DOI, Summary)), and if there is an entry, it gets inlined into the HTML as some quiet metadata; if there isn't, various scripts are called to try to generate metadata, and if nothing works, it falls back to a headless web browser taking a screenshot and saving it to a file. Then at runtime in the user browser, some JS checks every link for the metadata and if it exists, and the user mouses-over, it popups a form with the metadata filled in. If there is no metadata, it tries to fetch SHA1(URL) from a gwern.net folder under the assumption there will be the fallback screenshot.

There are a lot of fiddly details in how exactly you scrape from Arxiv or Pubmed Central or use the WP REST API, unfortunately. So many links must be handled by hand-writing definitions.

The relevant files in call order:

Comment by gwern on The unexpected difficulty of comparing AlphaStar to humans · 2019-09-18T21:58:46.300Z · score: 8 (5 votes) · LW · GW

I think it's interesting because once you read it, it's obvious that AS was going to happen and the approach was scaling (given that OA5 had scaled from a similar starting point, cf 'the bitter lesson'), but at the time, no one in DRL/ML circles even noticed the talk - I only found out about that Vinyals talk after AS came out and everyone was reading back through Vinyals's stuff and noticed he'd said something at Blizzcon (and then after a bunch of searching, I finally found that transcript, since the video is still paywalled by Blizzard). Oh well!

Comment by gwern on The unexpected difficulty of comparing AlphaStar to humans · 2019-09-18T13:39:58.850Z · score: 7 (3 votes) · LW · GW

I'd add http://starcraft.blizzplanet.com/blog/comments/blizzcon-2018-starcraft-ii-whats-next-panel-transcript to the chronology.

Comment by gwern on "AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 · 2019-09-10T23:10:16.391Z · score: 3 (1 votes) · LW · GW

I'm not quite sure what you mean. If you want other manifests for a more evolutionary or meta-learning approach, DM has https://arxiv.org/abs/1903.00742 which lays out a bigger proposal around PBT and other things they've been exploring, if not as all-in on evolution as Uber AI has been for years now.

Comment by gwern on [deleted post] 2019-09-09T02:45:36.747Z

Caplan is correct here. There's no 'far transfer' of the sort which might even slightly resemble 'get a 5% discount on all future fields you study'. (Not that we see anyone who exhibits such an 'educational singularity' in practice, anyway.) At best there might be a sort of meta-study-skill which gives a one-off 'far transfer' effect, like learning how to use search engines or spaced repetition, but it's quickly exhausted and of course just one doesn't give any singularity-esque effect.

A more plausible model would be one with pure near-transfer: every field has a few adjacent fields which give a say 5% near-transfer. So one could learn physics/chemistry/biology, for example, in 2.9x the time of 3 individuals learning the 3 fields separately at 3x the time.

Comment by gwern on Concrete experiments in inner alignment · 2019-09-07T02:54:01.347Z · score: 10 (5 votes) · LW · GW

What do you think of using AIXI.js as a testbed like https://arxiv.org/abs/1906.09136 does?

Comment by gwern on August 2019 gwern.net newsletter (popups.js demo) · 2019-09-04T23:49:13.686Z · score: 3 (1 votes) · LW · GW

The ad is an experiment: https://www.gwern.net/Ads#followup-test

The alignment is wrong, yes. I got the CSS id wrong when I set up the second one, I guess. Fixed.

Comment by gwern on August 2019 gwern.net newsletter (popups.js demo) · 2019-09-01T22:45:37.957Z · score: 5 (2 votes) · LW · GW

Obormot's working on it.

Comment by gwern on Link: That Time a Guy Tried to Build a Utopia for Mice and it all Went to Hell · 2019-08-12T21:29:01.453Z · score: 25 (7 votes) · LW · GW

I've summarized my problems with Mouse Utopia: https://www.gwern.net/Questions#mouse-utopia

Comment by gwern on Epistemic Spot Check: The Role of Deliberate Practice in the Acquisition of Expert Performance · 2019-08-09T20:21:16.588Z · score: 8 (4 votes) · LW · GW

Fundamentals of Skill, Welford (1968)

I've uploaded a scan if you want to look.

Comment by gwern on Is there a user's manual to using the internet more efficiently? · 2019-08-05T16:30:58.936Z · score: 3 (1 votes) · LW · GW

Is Net Smart very practical? The introduction sounds more theoretical and generic, and it's a good 7 years old now. (I noticed when I saw references in that link to long-defunct websites like CureTogether.)

Comment by gwern on Is there a user's manual to using the internet more efficiently? · 2019-08-05T16:29:00.706Z · score: 6 (4 votes) · LW · GW

In theory? You just generate a few random samples with the current text as the prefix and display them. In practice, there's already tools to do this: Talk to Transformer does autocomplete. Even better, IMO, is Deep TabNine for programming languages, trained off Github.

Comment by gwern on What supplements do you use? · 2019-08-01T22:11:30.421Z · score: 10 (4 votes) · LW · GW

The impression I got from asking people like James about metformin side-effects when I was trying a cost-benefit is that most of it has quick onset, like the gastrointestinal distress, and if you can't fix it by modifying the dose, you can simply discontinue it ie you have option value. This would reduce the EV a little but is not that big a deal. After all, metformin is one of the most (the most?) widely used chronic prescription drugs in the world & regarded as very safe, so the side effects can't be that bad, one would think.

The question of redundancy with other interventions is a more concerning one. Not all the metformin papers are positive in this regard. Here's a small paper suggesting that metformin blunts the benefits of exercise, and "Metformin alters the gut microbiome of individuals with treatment-naive type 2 diabetes, contributing to the therapeutic effects of the drug", Wu et al 2017, suggests part of metformin's benefits is by changing the microbiome, but of course, exercise or diet or lifestyle changes might also be changing the microbiome in precisely the same way... For diabetics, who have done what little they are able or willing to do, that presumably is not happening enough to cure their diabetes and so the average metformin effect is still worthwhile, but for those more rigorous about longevity, who knows?

I have similar concerns about baby aspirin and everything postulated to involve inflammation, and perhaps also the senolytics as well: they often seem to be hypothesized to be acting through similar pathways (eg inflammation causes/is caused by senescent cells, some say, but if exercise kills senescent cells by inducing autophagy, doesn't that imply it'd be at least partially redundant with taking a senolytic drug?). I'm not sure what could be done here except to directly test the potential for interactions in factorial experiments.

Comment by gwern on How often are new ideas discovered in old papers? · 2019-07-26T01:29:25.096Z · score: 19 (11 votes) · LW · GW

Citations can be used as the metadata. One of the closest corresponding things in cliometrics are 'sleeping beauty' papers, which instead of the usual gradual decline in citation rate, suddenly see a big uptick many years afterwards. The recent 'big teams vs small teams' paper discussed sleeping beauty papers a little: https://www.gwern.net/docs/statistics/bias/2019-wu.pdf You could also take multiple discovery as quantifying repetition, since one of the most common ways for a multiple to happen is for it to happen in a different field where it is also important/useful but they haven't heard of the original discovery in the first field.

There's a nice version of this with Ed Boyden on how old papers helped lead to the hot new 'expansion microscopy' thing (funded, incidentally, by OpenPhil): https://medium.com/conversations-with-tyler/tyler-cowen-ed-boyden-neuroscience-3907eccbd4ca

Comment by gwern on The Self-Unaware AI Oracle · 2019-07-24T16:24:40.525Z · score: 5 (2 votes) · LW · GW

GPUs aren't deterministic.

Comment by gwern on RAISE AI Safety prerequisites map entirely in one post · 2019-07-18T00:36:05.160Z · score: 21 (7 votes) · LW · GW

As a historical fact, you certainly can invent selective breeding without knowing anything we would consider true: consider Robert Bakewell and the wildly wrong theories of heredity current when he invented line breeding and thus demonstrated that breeds could be created by artificial selection. (It's unclear what Bakewell and/or his father thought genetics was, but at least in practice, he seems to have acted similarly to modern breeding practices in selecting equally on mothers/fathers, taking careful measurements and taking into account offspring performance, preserving samples for long-term comparison, and improving the environment as much as possible to allow maximum potential to be reached.) More broadly, humans had no idea what they were doing when they were domesticated everything; if Richard Dawkins is to be trusted, it seems that the folk genetics belief was that traits are not inherited and everything regressed to an environmental mean, and so one might as well eat one's best plants/animals since it'll make no difference. And even more broadly, evolution has no idea what 'it' is doing for anything, of course.

The problem is, as Eliezer always pointed out, that selection is extremely slow and inefficient compared to design - the stupidest possible optimization process that'll still work within the lifetime of Earth - and comes with zero guarantees of any kind. Genetic drift might push harmful variants up, environmental fluctuations might extinguish lineages, reproductively fit changes which Goodhart the fitness function might spread, nothing stops a 'treacherous turn', evolved systems tend to have minimal modularity and are incomprehensible, evolution will tend to build in instrumental drives which are extremely dangerous if there is any alignment problem (which there will be), sexual selection can drive a species extinct, evolved replicators can be hijacked by replicators on higher levels like memetics, any effective AGI design process will need to learn inner optimizers/mesa-optimizers which will themselves be unpredictable and only weakly constrained by selection, and so on. If there's one thing that evolutionary computing teaches, it's that these are treacherous little buggers indeed (Lehman et al 2018). The optimization process gives you what you ask for, not what you wanted.

So, you probably can 'evolve' an AGI, given sufficient computing power. Indeed, considering how many things in DL or DRL right now take the form of 'we tried a whole bunch of things and X is what worked' (note that a lot of papers are misleading about how many things they tried, and tell little theoretical stories about why their final X worked, which are purely post hoc) and only much later do any theoreticians manage to explain why it (might) work, arguably that's how AI is proceeding right now. Things like doing population-based training for AlphaStar or NAS to invent EfficientNet are just conceding the obvious and replacing 'grad student descent' with gradient descent.

The problem is, we won't understand why they work, won't have any guarantees that they will be Friendly, and they almost certainly will have serious blindspots/flaws (like adversarial examples or AlphaGo's 'delusions' or how OA5/AlphaStar fell apart when they began losing despite playing apparently at pro level before). NNs don't know what they don't know, and neither do we.

Nor are these flaws easy to fix with just some more tinkering. Much like computer security, you can't simply patch your way around all the problems with software written in C (as several decades of endless CVEs has taught us); you need to throw it out and start with formal methods to make errors like buffer overflows impossible. Adversarial examples, for instance: I recall that one conference had something like 5 adversarial defenses, all defined heuristically without proof of efficacy, and all of them were broken between the time of submission and the actual conference. Or AlphaGo's delusions couldn't be fixed despite quite elaborate methods being used to produce Master (which at least had better ELO) until they switched to the rather different architecture of AlphaZero. Neither OA5 nor AlphaStar has been convincingly fixed that I know of, they simply got better to the point where human players couldn't exploit them without a lot of practice to find reproducible ways of triggering blindspots.

So, that's why you want all the math. So you can come up with provably Friendly architectures without hidden flaws which simply haven't been triggered yet.

Comment by gwern on Against NHST · 2019-07-16T17:15:46.008Z · score: 5 (2 votes) · LW · GW

"From Statistical Significance To Effect Estimation: Statistical Reform In Psychology, Medicine And Ecology", Fidler 2005; a broad but still in depth thesis on the history of NHST and attempts to reform it.

Comment by gwern on How Should We Critique Research? A Decision Perspective · 2019-07-15T22:12:19.701Z · score: 5 (3 votes) · LW · GW

Does the abstract not work for you?

Comment by gwern on Strategic implications of AIs' ability to coordinate at low cost, for example by merging · 2019-07-08T17:06:16.576Z · score: 7 (3 votes) · LW · GW

Avalon is another example, with better current performance.