Posts

[Proposal] Method of locating useful subnets in large models 2021-10-13T20:52:14.455Z
How good is security for LessWrong and the Alignment Forum? 2021-10-04T22:27:24.604Z
Meta learning to gradient hack 2021-10-01T19:25:29.595Z
New GPT-3 competitor 2021-08-12T07:05:49.074Z
Is there work looking at the implications of many worlds QM for existential risk? 2021-06-22T06:06:15.132Z
A simple way to make GPT-3 follow instructions 2021-03-08T02:57:37.218Z

Comments

Comment by Quintin Pope (quintin-pope) on [Prediction] We are in an Algorithmic Overhang, Part 2 · 2021-10-18T23:32:21.529Z · LW · GW

My preferred algorithmic metric would be compute required to reach a certain performance level. This doesn’t really work for hand-crafted expert systems. However, I don’t think those are very informative of future AI trajectories.

Comment by Quintin Pope (quintin-pope) on [Prediction] We are in an Algorithmic Overhang, Part 2 · 2021-10-18T23:22:03.616Z · LW · GW

You can easily combine multiple horses into a “super-equine” transport system by arranging for fresh horses to be available periodically across the journey and pushing each horse to unsustainable speeds.

Also, I don’t think it’s very hard to reach somewhat superhuman performance with BCIs. The difference between keyboards and the BCIs I’m thinking of is that my BCIs can directly modify neurology to increase performance. E.g., modifying motivation/reward to make the brains really value learning about/accomplishing assigned tasks. Consider a company where every employee/manager is completely devoted to company success, fully trust each other and have very little internal politicking/empire building. Even without anything like brain-level, BCI enabled parallel problem solving or direct intelligence augmentation, I’m pretty sure such a company would perform far better than any pure human company of comparable size and resources.

Comment by Quintin Pope (quintin-pope) on [Prediction] We are in an Algorithmic Overhang, Part 2 · 2021-10-18T06:41:19.838Z · LW · GW

I think using AI + BCI + human brains will be easier than straight AI for the same reason that it’s easier to finetune pretrained models for a specific task than it is to create a pretrained model. The brain must have pretty general information processing structure, and I expect it’s easier to learn the interface / input encoding for such structures than it is to build human level AI.

Part of that intuition comes from how adaptable the brain is to injury, new sensory modalities, controlling robotic limbs, etc. Another part of the intuition comes from how much success we’ve seen even with relatively unsophisticated efforts to manipulate brains, such as curing depression.

Comment by Quintin Pope (quintin-pope) on [Prediction] We are in an Algorithmic Overhang, Part 2 · 2021-10-17T21:37:07.842Z · LW · GW

I’m actually woking on an AI progress timeline / alignment failure story where the big risk comes from BCI-enabled coordination tech (I've sent you the draft if you're interested). I.e., instead of developing superintelligence, the timeline develops models that can manipulate mood/behavior through a BCI, initially as a cure for depression, then gradually spreading through society as a general mood booster / productivity enhancer, and finally being used to enhance coordination (e.g., make everyone super dedicated to improving company profits without destructive internal politics). The end result is that coordination models are trained via reinforcement learning to maximize profits or other simple metrics and gradually remove non-optimal behaviors in pursuit of those metrics.

This timeline makes the case that AI doesn’t need to be superhuman to pose a risk. The behavior modifying models manipulate brains through BCIs with far fewer electrodes than the brain has neurons and are much less generally capable than human brains. We already have a proof of concept that a similar approach can cure depression, so I think more complex modifications like loyalty/motivation enhancement are possible in the not too distant future.

You may also find the section of my timeline addressing progress standard in AI interesting:

My rough mental model for AI capabilities is that they depend on three inputs:

  1. Compute per dollar. This increases at a somewhat sub-exponential rate. The time between 10x increases is increasing. We were initially at ~10x increase every four years, but recently slowed to ~10x increase every 10-16 years (source).
  2. Algorithmic progress in AI. Each year, the compute required to reach a given performance level drops by a constant factor, (so far, a factor of 2 every ~16 months) (source). I think improvements to training efficiency drive most of the current gains in AI capabilities, but they'll eventually begin falling off as we exhaust low hanging fruit.
  3. The money people are willing to invest in AI. This increases as the return on investment in AI increases. There was a time when money invested in AI rose exponentially and very fast, but it’s pretty much flattened off since GPT-3. My guess is this quantity follows a sort of stutter-stop pattern where it spikes as people realize algorithmic/hardware improvements make higher investments in AI more worthwhile, then flattens once the new investments exhaust whatever new opportunities progress in hardware/algorithms allowed.

When you combine these somewhat sub-exponentially increasing inputs with the power-law scaling laws so far discovered (see here), you probably get something roughly linear, but with occasional jumps in capability as willingness to invest jumps.

I think there's a reasonable case that AI progress will continue at approximately the same trajectory as it has over the last ~50 years.

Comment by Quintin Pope (quintin-pope) on [Prediction] We are in an Algorithmic Overhang, Part 2 · 2021-10-17T21:12:20.428Z · LW · GW

Thanks! I’m pretty sure this isn’t the one I saw, but it works even better for my purposes.

Edit: I'm working on an AI timeline / risk scenario where BCIs and neuro-imitative AI play a big role. I've sent you the draft if you're interested.

Comment by Quintin Pope (quintin-pope) on [Prediction] We are in an Algorithmic Overhang, Part 2 · 2021-10-17T20:02:44.216Z · LW · GW

You could also likely build superintelligence by wiring up human brains with brain computer interfaces, then using reinforcement learning to generate some pattern of synchronized activations and brain-to-brain communication that prompts to brains collectively solve problems more effectively than a single brain is able to - a sort of AI guided super-collaboration. That would bypass both the algorithmic complexity and the hardware issues.

The main constraints here are the bandwidths of brain computer interfaces (I saw a publication that derived a Moore’s law-like trend for this, but now can’t find it. If anyone knows where to find such a result, please let me know.) and the difficulty of human experiments.

Comment by Quintin Pope (quintin-pope) on NLP Position Paper: When Combatting Hype, Proceed with Caution · 2021-10-16T04:58:44.587Z · LW · GW

Link is broken for me.

Comment by Quintin Pope (quintin-pope) on Covid 10/14: Less Long Cvoid · 2021-10-15T02:31:08.315Z · LW · GW

Also typo: “reportis” (“is” belongs to a hyperlink, so the typo may look like “report[is” depending on your editor).

Comment by Quintin Pope (quintin-pope) on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG · 2021-10-12T21:25:19.522Z · LW · GW
  1. Deep learning/backprop has way more people devoted to improving its efficiency than Hebbian learning.
  2. Those 100x slowdown results were for a Hebbian learner trying to imitate backprop, not learn as efficiently as possible.
  3. Why would training GPT-3 on its own output improve it at all? Scaling laws indicate there’s only so much that more training data can do for you, and artificial data generated by GPT-3 would have worse long term coherence than real data.
Comment by Quintin Pope (quintin-pope) on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG · 2021-10-12T01:18:21.713Z · LW · GW

If you just look at models before GPT-3, the trend line you’d draw is still noticeably steeper than the actual line on the graph. (ELMO and BERT large are below trend while T5 and Megatron 8.3B are above.) The new Megatron would represent the biggest trend line undershoot.

Also, I think any post COVID speedup will be more than drown out by the recent slow down in the rate at which compute prices fall. They were dropping by an OOM every 4 years, but now it’s every 10-16 years.

Comment by Quintin Pope (quintin-pope) on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG · 2021-10-12T01:08:23.408Z · LW · GW

“…whatever missing algorithms are necessary to do so might turn out to not make the process infeasibly more expensive.”

I think this is very unlikely because the human brain uses WAY more compute than GPT-3 (something like at least 1000x more on low end estimates). If the brain, optimized for efficiency by millions of years of evolution, is using that much compute, then lots of compute is probably required.

Comment by Quintin Pope (quintin-pope) on Shoulder Advisors 101 · 2021-10-12T00:55:40.021Z · LW · GW

It’s definitely possible, though perhaps shoulder advisor is the wrong phrase to use at that point. Maybe it would be better to describe such a practice as a nonverbal mental ritual, rather than using an “angenty” framing.

You picture an incredibly happy crystal that blazes with light and feelings of positivity and acceptance (for this step, it may be helpful to put a cartoonish smily face on the crystal or to imagine it dancing, hugging you, etc). Then let those feelings radiate out from the crystal and into you, until you primarily feel the emotion from yourself. Allow yourself to be happy for the crystal’s happiness. Your own mood should naturally reflect that of the crystal as you lean into emulating the crystal’s radiant positivity.

It may also help to picture the crystal as being delighted to share that happiness with you. In this framing, both you and the crystal are happy to share your own joy with the other. Alternate between you sharing happiness with the crystal and the crystal sharing happiness with, both delighted by the other’s joy.

Note that visualisations of the sun, moon, a star, a glowing cloud, etc also work well for this exercise. I find that picturing the light as an ever-shifting rainbow of colors helps add some texture and adds dynamism to the crystal’s emotions. I also have difficulty holding a static image in my head for a long time, and the rainbow effect helps with that.

Comment by Quintin Pope (quintin-pope) on Shoulder Advisors 101 · 2021-10-11T22:05:04.488Z · LW · GW

Possibly just the act of installing supportive shoulder advisors would be helpful. The brain only has so much capacity for shoulder advisors, so earmarking some of that for positive advisors may “clog the channel” so to speak. Bear in mind that shoulder advisors can be more abstract than is discussed here. E.g., you could have a wordless, nameless shard of pure positivity and acceptance.

Also, I expect shoulder advisors have a global positivity parameter that you may be able to influence. When a bad advisor tries to say something bad, stop them and force them to say something good instead, while imagining that the advisor truly believes the good thing. If your shoulder advisor objects to this practice, “correct” their objection and imagine them encouraging you to “remove the maladaptive cognitive pattern my irrational and unwarranted hostility represents”, or something like that.

Comment by Quintin Pope (quintin-pope) on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG · 2021-10-11T19:43:16.802Z · LW · GW

“…even small differences of 1% means there is a noticeable difference in intelligence when using the models for text generation.”

I wish we had better automated metrics for that sort of subjective quality measure. A user study of subjective quality/usefulness would have been good too. That’s not too much to ask of Microsoft, and since they’re presumably aiming to sell access to this and similar models, it would be good for them to provide some indication of how capable the model feels to humans.

Comment by Quintin Pope (quintin-pope) on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG · 2021-10-11T19:30:52.938Z · LW · GW

I can only find the four examples near the end of the blog post (just above the "Bias in language models" section).

Comment by Quintin Pope (quintin-pope) on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG · 2021-10-11T19:21:48.915Z · LW · GW

I don't think that's likely. Rates of increase for model size are slowing. Also, the scaling power-laws for performance and parameter count we've seen so far suggest future progress is likely ~linear and fairly slow. 

Comment by Quintin Pope (quintin-pope) on NVIDIA and Microsoft releases 530B parameter transformer model, Megatron-Turing NLG · 2021-10-11T19:17:04.919Z · LW · GW

 

This seems to reflect a noticeable slowdown in the rate at which language model sizes increase. Compare the trend line you'd draw through the prior points to the one on the graph.

I'm still disappointed at the limited context window (2048 tokens). If you're going to spend millions on training a transformer, may as well make it one of the linear time complexity variants. 

It looks like Turing NLG models are autoregressive generative, like GPTs. So, not good at things like rephrasing text sections based on bidirectional context, but good at unidirectional language generation. I'm confused as to why everyone is focusing on unidirectional models. It seems like, if you want to provide a distinct service compared to your competition, bidirectionality would be the way to go. Then your model would be much better at things like evaluating text content, rephrasing, or grammar checking. Maybe the researchers want to be able to compare results with prior work?

Comment by Quintin Pope (quintin-pope) on Steelman arguments against the idea that AGI is inevitable and will arrive soon · 2021-10-11T05:21:57.323Z · LW · GW

Whoops! Corrected.

Comment by Quintin Pope (quintin-pope) on Secure homes for digital people · 2021-10-10T21:04:15.606Z · LW · GW

You can probably create your own companions. Maybe a modified fork of yourself?

There may also be an open source project that compiles validated and trustworthy digital companions (e.g., aligned AIs or uploads with long, verified track records of good behavior).

Comment by Quintin Pope (quintin-pope) on Secure homes for digital people · 2021-10-10T20:46:14.342Z · LW · GW

Thank you for this post. Various “uploading nightmare” scenarios seem quite salient for many people considering digital immortality/cryonics. It’s good to have potential countermeasures that address such worries.

My concern about your proposal is that, if an attacker can feed you inputs and get outputs, they can train a deep model on your inputs/outputs, then use that model to infer how you might behave under rewind. I expect the future will include deep models extensively pretrained to imitate humans (simulated and physical), so the attacker may need surprisingly little of your inputs/outputs to get a good model of you. Such a model could also use information about your internal computations to improve its accuracy, so it would be very bad to leak such info.

I’m not sure what can be done about such a risk. Any output you generate is some function of your internal state, so any output risks leaking internal state info. Maybe you could use a “rephrasing” neural net module that modifies your outputs to remove patterns that leak personality-related information? That would cause many possible internal states to map onto similar input/output patterns and make inferring internal state more difficult.

You could also try to communicate only with entities that you think will not attempt such an attack and that will retain as little of your communication as possible. However, both those measures seem like they’d make forming lasting friendships with outsiders difficult.

Comment by Quintin Pope (quintin-pope) on The Extrapolation Problem · 2021-10-10T08:26:08.934Z · LW · GW

That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.

Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)

Comment by Quintin Pope (quintin-pope) on The Extrapolation Problem · 2021-10-10T08:10:38.696Z · LW · GW

I was thinking of the NN approximating the “extrapolate” function itself, that thing which takes in partial data and generates an extrapolation. That function is, by assumption, capable of extrapolation from incomplete data. Therefore, I expect a sufficiently precise approximation to that function is able to extrapolate.

It may be helpful to argue from Turing completeness instead. Transformers are Turing complete. If “extrapolation” is Turing computable, then there’s a transformer that implements extrapolation.

Also, the post is talking about what NNs are ever able to do, not just what they can do now. That’s why I thought it was appropriate to bring up theoretical computer science.

Comment by Quintin Pope (quintin-pope) on The Extrapolation Problem · 2021-10-10T07:33:03.227Z · LW · GW

Your point about neural nets NEVER being able to extrapolate is wrong. NNs are universal function approximators. A sufficiently large NN with the right weights can therefore approximate the “extrapolation” function (or even approximate whatever extrapolative model you’re training in place of an NN). I usually wouldn’t bring up this sort of objection since actually learning the right weights is not guaranteed to be feasible with any algorithm and is pretty much assuming away the entire problem, but you said “An algorithm that can't extrapolate is an algorithm that can't extrapolate. Period.”, so I think it’s warranted.

Also, I’m pretty sure “extrapolation” is essentially Bayesian:

  • start with a good prior over the sorts of data distributions you’ll likely encounter
  • update that prior with your observed data
  • generate future data from the resulting posterior

There’s nothing in there that NNs are fundamentally incapable of doing.

Finally, your reference to neural ODEs would be more convincing if you’d shown them reaching state if the art results in benchmarks after 2019. There are plenty of methods that do better than NNs when there’s limited data. The reason NNs remain so dominant is that they keep delivering better results as we throw more and more data at them.

Comment by Quintin Pope (quintin-pope) on Shoulder Advisors 101 · 2021-10-10T05:39:11.983Z · LW · GW

On the one hand, I agree that potential side effects are important. Shoulder advisors seem very similar to tulpas, and mental health disorders are very common (~50% or so) in the tulpamancy community. Though this paper argues that this is because mental health issues cause people to be drawn towards tulpamancy, and that tulpamancy can benefit those with mental illnesses. Of course, people who had negative experiences with tulpas would likely leave the community and not be available for surveys.

On the other hand, shoulder advisors and tulpas are fundamentally exercises in prediction. Your perception of the practice influences the results you’ll get. If you create shoulder advisors with the assumption that they’ll go wrong immediately, your odds of a beneficial outcome drop. It’s thus important not to over-emphasize potential negatives.

Comment by Quintin Pope (quintin-pope) on How to think about and deal with OpenAI · 2021-10-09T23:25:51.982Z · LW · GW

Well, the big, obvious, enormous difference between current deep models and the human brain is that the brain uses WAY more compute. Even the most generous estimates put GPT-3 at using something like 1000x less compute than the brain. OpenAI demonstrated, quite decisively, that increasing model size leads to increasing performance.

Also, generating dialog in video games is NOT trivial (and is well beyond GPT-3's capabilities). Any AI capable of that would be enormously valuable, since it would need a generalized grasp of language close to human-level proficiency and could be adapted to many text generation tasks (novel/script writing, customer service, chatbot based substitutes for companionship, etc).

Comment by Quintin Pope (quintin-pope) on Shoulder Advisors 101 · 2021-10-09T23:06:39.495Z · LW · GW

Thanks for this great post!

TL;DR for my own thoughts:

  • I speculate on why shoulder advisors are useful
    • Drawing from ensemble methods in machine learning
    • Drawing from predictive processing and overfitting
  • I discuss generating a common corpus of training data for useful shoulder advisors
  • I discuss a few archetypes for useful shoulder advisors and their uses

Ensembles

Shoulder advisors seem to mirror a common machine learning technique, ensembling, which combines multiple ML models to get better overall performance than any individual model can reach. E.g., an ensemble of ERNIE models holds the current first place on the GLUE leaderboard (a metric for evaluating the general capabilities of language models). Shoulder advisors let you sort of ensemble thoughts across different personalities. Ensemble approaches are most helpful when the ensembled population is diverse and each model tends to specialize in particular types of tasks. That matches your usefulness criteria fairly well.

Predictive processing

If we extend predictive processing theory to internal personality traits, then our own personalities are generated by a predictive process, presumably one that bootstraps itself by predicting behavior and emotional reactions from family and friends in childhood ("other-prediction") before specializing in predicting/generating our own thoughts and emotions ("self-prediction"). Under this view, we use broadly similar neural circuitry for self-prediction, other-prediction, and generating shoulder advisors. Presumably, these three processes share common features in the brain, but with their own sets of neurons that specialize in each.

We spend far more time on self-prediction than on other-prediction, and have far weaker external signals while doing so. It's possible that the neural circuits specializing in self-prediction "overfit" more strongly than the circuits specializing in other-prediction. If you're repeatedly taking counterproductive actions, the negative reward signal from doing so may not be enough to push the overfit circuits out of predicting/generating that behavior.

Generating shoulder advisers, as a intermediate between self and other prediction, may counteract such overfitting by prompting your self-prediction neurons to interact more strongly with your other-prediction neurons. This allows the more general features and patterns learned by the other-prediction neurons to more easily feed into your own behavior and allows you to more easily pick up useful strategies and drop maladaptive behavior. In this view, it may be useful to continually rotate shoulder advisors, so that your self-prediction circuits receive constantly evolving feedback from your other-prediction circuits.

Training corpus

If shoulder advisors are beneficial, we should systematically aim to further improve their quality and the ease of generating them. One option would be to create a set of shoulder advisors whose personalities and mannerisms are optimized to fulfill common needs. Then, we can compile a corpus of "training data" for each advisor, meaning text describing the advisor and showing their responses to a variety of situations. Then, a person who wants to "install" a particular advisor reads the training data while having their current instantiation of the advisor predict how they'd behave in each situation supplied by the training data. 

Archetypes 

Here are a few shoulder advisor personality archetypes and associated advisor uses we might consider:

  • "The friend"
    • Traits:
      • Friendly, kind, empathetic, warm
      • Supportive, encouraging
      • Calm, equanimitous, happy
      • A deep feeling of beneficence towards you
    • Uses 
      • Emotional wellbeing
      • Relaxation
      • Promoting interpersonal empathy
      • Promoting positive self-worth
  • "The rationalist"
    • Traits:
      • Brilliant, analytical, incisive
      • Curious, widely-read, interested in knowledge
      • Quick to change mind in response to evidence
      • Quick to acknowledge mistaken cognition without undue emotional complications
    • Uses:
      • Analyzing data, problem solving, coming up with and evaluating new ideas
      • Learning new things
      • Motivation to read scientific papers
      • Actually changing your mind, recognizing mistakes
  • "The socialite"
    • Traits:
      • Friendly, sociable, outgoing, chatty
      • Empathetic, interested in others' perspectives
    • Uses:
      • Navigating social situations
      • Overcoming social awkwardness/anxiety
  • "The determinator"
    • Traits:
      • Focused, determined
      • unstoppable, absolute
      • Immense pain tolerance, little concern for own suffering
    • Uses:
      • Motivation to exercise/do chores/work
      • Pushing though unpleasantness, dealing with hardship
Comment by Quintin Pope (quintin-pope) on Steelman arguments against the idea that AGI is inevitable and will arrive soon · 2021-10-09T07:56:25.194Z · LW · GW

I think longish timelines (>= 50 years) are the default prediction. My rough mental model for AI capabilities is that they depend on three inputs:

  1. Compute per dollar. This increases at a somewhat sub-exponential rate. The time between 10x increases is increasing. We were initially at ~10x increase every four years, but recently slowed to ~10x increase every 10-16 years (source).
  2. Algorithmic progress in AI. Each year, the compute required to reach a given performance level drops by a constant factor, (so far, a factor of 2 every ~16 months) (source). I think improvements to training efficiency drive most of the current gains in AI capabilities, but they'll eventually begin falling off as we exhaust low hanging fruit.
  3. The money people are willing to invest in AI. This increases as the return on investment in AI increases. There was a time when money invested in AI rose exponentially and very fast, but it’s pretty much flattened off since GPT-3. My guess is this quantity follows a sort of stutter-stop pattern where it spikes as people realize algorithmic/hardware improvements make higher investments in AI more worthwhile, then flattens once the new investments exhaust whatever new opportunities progress in hardware/algorithms allowed.

When you combine these somewhat sub-exponentially increasing inputs with the power-law scaling laws so far discovered (see here), you probably get something roughly linear, but with occasional jumps in capability as willingness to invest jumps.

We've recently seen a jump, but progress has stalled since then. GPT-2 to GPT-3 was 16 months. It's been another 16 months since GPT-3, and the closest thing to a GPT-3 successor we've seen is Jurassic-1, but even that's only a marginal improvement over GPT-3.

Given the time it took us to reach our current capabilities, human level AGI is probably far off.

Comment by Quintin Pope (quintin-pope) on Meta learning to gradient hack · 2021-10-08T05:48:52.817Z · LW · GW

I checked the intermediate network activations. It turns out the meta-learned network generates all-negative activations for the final linear layer, so the the relu activations zero out the final layer’s output (other than bias), regardless of initial network input.

I’ve begun experiments with flipped base and meta functions (network initially models sin(x) and resists being retrained to model f(x) = 1).

Comment by Quintin Pope (quintin-pope) on Meta learning to gradient hack · 2021-10-08T05:47:36.140Z · LW · GW

I checked the intermediate network activations. It turns out the meta-learned network generates all-negative activations for the final linear layer, so the the relu activations zero out the final layer’s output (other than bias), regardless of initial network input. You’re right about it only working for constant functions, due to relu saturation and not changes to the batchnorm layers.

I’ve begun experiments with flipped base and meta functions (network initially models sin(x) and resists being retrained to model f(x) = 1).

Comment by Quintin Pope (quintin-pope) on How much memory is reserved for cute kitten pictures? · 2021-10-04T23:18:35.246Z · LW · GW

Look at ImageNet (https://image-net.org/index.php) tags and find the percent of them that are kitten pictures. The International Data Corperation estimates there are around 6.8 zettabytes of storage globally (https://www.idc.com/getdoc.jsp?containerId=prUS46303920). Now we just need the fraction of total storage dedicated to consumer images. Maybe 2%?

I’d guess something like (0.1% kitten pictures) x (2% consumer images) x (6.8 zettabytes) = 21,500 terabytes of kitten images.

Comment by Quintin Pope (quintin-pope) on Meta learning to gradient hack · 2021-10-02T01:02:10.238Z · LW · GW

Thanks for the feedback! I use batch norm regularisation, but not dropout.

I just tried retraining the 100,000 cycle meta learned model in a variety of ways, including for 10,000 steps with 10,000x higher lr, using resilient backprop (which multiplies weights by a factor to increase/decrease them), and using an L2 penalty to decrease weight magnitude. So far, nothing has gotten the network to model the base function. The L2 penalty did reduce weight values to ~the normal range, but the network still didn’t learn the base function.

I now think the increase in weight values is just incidental and that the meta learner found some other way of protecting the network from SGD.

Comment by Quintin Pope (quintin-pope) on Proposal: Scaling laws for RL generalization · 2021-10-01T22:41:05.759Z · LW · GW

This seems like a very good research direction. I’m not too familiar with RL, so I probably won’t pursue it myself, though. I do have three suggestions:

  1. For testing out of distribution transfer, one option is to move the agents to a different environmental simulator. I expect this will do a better job of representing the distributional shift incurred by deploying into the real world.
  2. Consider feeding the agents their goals via natural language instruction, and accompany their RL training with BERT-like language modeling. Agents trained to follow natural language instructions seem vastly more useful (and slightly more aligned) than agents limited to receiving instructions via bitcodes, which is what DeepMind did. I expect future versions of XLand-like RL pretraining will do something like this.
  3. Consider using the Perceiver IO architecture (https://arxiv.org/abs/2107.14795). It’s a new variant transformer with linear time complexity in its input length, is explicitly designed to smoothly handle inputs of arbitrary dimensions and mixed modalities, and can also produce outputs of arbitrary dimensions and mixed modalities. I think it will turn out to be far more flexible than current transformers/CNNs.
Comment by Quintin Pope (quintin-pope) on How do you decide when to change N95/FFP-2 masks? · 2021-09-16T13:39:50.189Z · LW · GW

I see no reason to doubt their claims about inactivating viruses on the mask. However, at ~$8 per mask, it would be cheaper to just use one normal n95 per day than to use one of these for 3 days. I expect the antiviral masks will also lose filter efficacy and fit quality with reuse. Also, I don’t think self inoculation is very likely if you’re careful about handling the mask and wash your hands after. So, it’s probably overall safer to use one n95 a day than to reuse an antiviral mask for 3 days.

Comment by Quintin Pope (quintin-pope) on How do you decide when to change N95/FFP-2 masks? · 2021-09-11T18:41:49.142Z · LW · GW

You're right. I'll fix that.

Comment by Quintin Pope (quintin-pope) on How do you decide when to change N95/FFP-2 masks? · 2021-09-10T22:01:55.503Z · LW · GW

People rarely talk, laugh or scream on public transport, so the risk is much lower compared to somewhere like a bar or hospital. Also, I’m talking about relative contamination levels. Even if you’re only lightly exposed for 20 minutes, the concentration of virus on your mask is probably ~hundreds of times higher than the concentration on your clothes.

Consider the volume of air you breathe in 20 min. 95% of the virus in that air is now on your mask. Compare that to the volume of virus that settles out of the air onto your clothes. Considering COVID can remain in air for hours, that amount is likely much smaller.

Comment by Quintin Pope (quintin-pope) on How do you decide when to change N95/FFP-2 masks? · 2021-09-10T21:54:23.123Z · LW · GW

I address that in the general comments section. Exhalation valves do make n95s worse at source control compared with n95s without a valve. However an n95 with a valve is still about as good at protecting people around you as a cloth mask or surgical mask.

Comment by Quintin Pope (quintin-pope) on How do you decide when to change N95/FFP-2 masks? · 2021-09-10T20:52:25.664Z · LW · GW

Thank you for the question. I'll add my response to my answer.

Keep in mind that the filter surfaces (where the air flows through) of the mask may have spent hours collecting COVID particles from the atmosphere, since you've been continuously pulling contaminated air through the mask. The filter surfaces may have thousands of times the level of contamination typically seen in solid surfaces. It's best to handle potentially contaminated masks by the straps, especially when removing the mask. If you absolutely must touch the mask itself, avoid touching filter surfaces, and instead touch a portion of the mask's edge that's away from your mouth and eyes. However, virus can still potentially transfer to your hands, even with proper handling (source). Thus, you should wash your hands after handling a potentially contaminated mask.

Reaerosolization of filtered particles is possible, but seems to only occur to a significant degree when the humidity is low and the particles in question are large and dry (source, source). Virus particles are typically either small and dry or large and wet (when suspended in water droplets), so I don't think this is the primary concern. I guess avoid inhaling too close to the mask's outer surface if you're worried about this.

Comment by Quintin Pope (quintin-pope) on Paths To High-Level Machine Intelligence · 2021-09-10T20:08:16.350Z · LW · GW

Thank you for this excellent post. Here are some thoughts I had while reading.

The hard paths hypothesis:

I think there's another side to the hard paths hypothesis. We are clearly the first technology-using species to evolve on Earth. However, it's entirely possible that we're not the first species with human-level intelligence. If a species with human level intelligence but no opposable thumbs evolved millions of years ago, they could have died out without leaving any artifacts we'd recognize as signs of intelligence.

Besides our intelligence, humans seem odd in many ways that could plausibly contribute to developing a technological civilization.

  • We are pretty long-lived.
  • We are fairly social.
    • Feral children raised outside of human culture experience serious and often permanent mental disabilities (Wikipedia).
    • A species with human-level intelligence, but whose members live mostly independently may not develop technological civilization.
  • We have very long childhoods.
  • We have ridiculously high manual dexterity (even compared to other primates).
  • We live on land.

Given how well-tuned our biology seems for developing civilization, I think it's plausible that multiple human-level intelligent species arose in Earth's history, but additional bottlenecks prevented them from developing technological civilization. However, most of these bottlenecks wouldn't be an issue for an intelligence generated by simulated evolution. E.g., we could intervene in such a simulation to give low-dexterity species other means of manipulating their environment. Perhaps Earth's evolutionary history actually contains n human-level intelligent species, only one of which developed technology. That implies the true compute required to evolve human-level intelligence is far lower.

Brain imitation learning:

I also think the discussion of neuromophic AI and whole brain emulation misses an important possibility that Gwern calls "brain imitation learning". In essence, you record a bunch of data about human brain activity (using EEG, implanted electrodes, etc.), then you train a deep neural network to model the recorded data (similar to how GPT-3 or BERT model text). The idea is that modeling brain activity will cause the deep network to learn some of the brain's neurological algorithms. Then, you train the deep network on some downstream task and hope its learned brain algorithms generalize to the task in question.

I think brain imitation learning is pretty likely to work. We've repeatedly seen in deep learning that knowledge distillation (training a smaller student model to imitate a larger teacher model) is FAR more computationally efficient than trying to train the student model from scratch, while also giving superior performance (Wikipedia, distilling BERT, distilling CLIP). Admittedly, brain activity data is pretty expensive. However, the project that finally builds human-level AI will plausibly cost billions of dollars in compute for training. If brain imitation learning can cut the price by even 10%, it will be worth hundreds of millions in terms of saved compute costs.

Comment by Quintin Pope (quintin-pope) on The Best Software For Every Need · 2021-09-10T14:36:21.012Z · LW · GW

Software: SensorLog

Need: IOS app for continuously recording iPhone sensor data at all times.

Other programs I've tried: Toolbox - Smart Meter Tools, Sensors Toolbox - Multitool, phyphox, Physics Toolbox Sensor Suite, Gauges

I've tried many apps that let you see sensor data from your iPhone, but, SensorLog is the first that lets you log gigabytes of data in the background continuously for multiple days. Ironically, it's also one of the smallest apps I've used, at just 2.2 MB. My only issue with it is that the average audio dB logs seem to be bugged for long-term recordings.

Comment by Quintin Pope (quintin-pope) on How do you decide when to change N95/FFP-2 masks? · 2021-09-10T13:59:54.195Z · LW · GW

General Discussion:

Broadly speaking, there are five issues to worry about when reusing masks:

  1. Virus particles contaminate the mask's surface, and may spread to you while handling the mask.
    • Mask filter surfaces (where the air flows through) may have spent hours collecting COVID particles from the air, since you've been continuously pulling contaminated air through the mask. The filter surfaces may have hundreds or thousands of times the contamination seen in solid surfaces.
    • Reaerosolization of filtered particles (where particles trapped by the mask re-enter the air) is possible, but likely releases negligible amounts of virus (source, source) compared to the ~5% an N95 mask fails to stop.
    • There are a number of approaches for decontaminating masks. For coronavirus specifically, the simplest approach is to just let the masks sit. The time necessary to inactivate COVID virons depends on the temperature and humidity. Options include: (source)
      • 4 days at 21-23 °C, 40% humidity (However, this source indicates virons may be present after 6 days)
      • 1 hour at 70 °C, any% humidity (using, e.g., an oven)
      • Boiling water for 5 min (may lose ~8% filtration efficacy, but also cleans mask of dirt)
    • UV-C radiation can also decontaminate masks. However, this process is potentially unreliable because the UV intensity needed varies with mask material, masks with unusual geometry may shadow portions of the mask from treatment, and dirt or other soilage may block radiation (source). Make sure to use >= 1 J/cm^2 for >=1 minute (source). Don't use > 10 J/cm^2 to avoid damaging mask structure.
    • Chemical agents such as ethanol and bleach may reduce mask filtration (source).
  2. Loss of mask structure prevents a good fit to your face.
    • Generally, it's hard to properly fit an N95. Among 74 anesthesiologists, 63% of women and 29% of men failed fit testing, even with a fresh respirator (source). Overall failure rates were 43% after 4 days, 50% after 10 days and 55% after 15 days. Additionally, people were very bad at estimating the quality of their fits, with 73% of those who failed the test thinking they had a good fit.
  3. Loss of electrostatic charge worsens filtration efficacy.
    • N95 masks don't lose much efficacy if they're just stored, even for years at a time (source). However, they do eventually lose efficacy if they're actually used. With 8 hours per day of use, N95 masks retain ~95% efficacy after 3 days, ~92% efficacy after 5 days, and drop to ~80% efficacy after 14 days (source). Note: this refers to just the material's filtration efficacy, and does not take into account any further reduction due to worsened fit quality.
    • Mask electrostatic charge degrades more quickly in humid environments (source). Thus, a mask with an exhalation valve will likely last longer. An N95 respirator with exhalation valve is likely as effective at source control (preventing spread from you to others) as a cloth or surgical mask (source), but many establishments (such as airlines) do not allow masks with exhalation valves.
    • If you want to get fancy, this paper describes a procedure for recharging a mask's electrostatic potential. However, that won't help with the loss of structure issue.
  4. Accumulation of filtered particulate makes the mask harder to breathe through and makes inhaled air more likely to pass around the mask rather than through it.
    • I don't think this is usually an issue because loss of structure/efficacy will force you to change masks more quickly than the masks get clogged. However, if you're in a dusty/smokey location, it could be a problem. I suggest changing out a mask as soon as you notice it getting more difficult to breathe. You can also wear a surgical mask over the N95 to protect it from larger contaminants.
  5. Accumulation of sweat/dirt/etc makes the mask disgusting to wear.
    • I suppose this is up to personal preference.

Final Recommendation:

I'd suggest replacing an N95 mask at least once every 5 days, and preferably once every 3 days. I'd suggest 1 hour at 70 °C for decontamination. Additionally:

  • I'd recommend using a mask with an exhalation valve, if you can.
  • I'd recommend storing masks in a low-humidity environment while not using them.
  • It's best to handle potentially contaminated masks by the straps, especially during removal. If you absolutely must touch the mask itself, avoid touching filter surfaces or interior, and instead touch the edge of the mask somewhere that's away from your mouth and eyes.
  • Virus can still transfer to your hands, even with proper handling (source). You should wash your hands after handling a potentially contaminated mask.
  • You should never stack potentially contaminated masks. I.e., don't allow the filter surface of one mask to be in contact with the interior of another.
  • Wearing a cloth/surgical mask over the N95 will help protect it from splashes and large contaminants. However, this may accelerate the loss of electrostatic charge by increasing the humidity within the mask. Do this if you think your N95 may get spoiled otherwise.
  • If decontamination is a chore, one option would be to use a rotating set of masks, wearing one a day sequentially until you're worn them all once, then decontaminate the entire set using an oven.

Additionally, you may want to consider alternatives to N95s. Half-face elastomeric respirators are designed to be reusable, are far more protective than even a properly fitted N95, are much easier to fit properly, and I personally found them much more comfortable than expected. Additionally, they only require replacement filters when breathing becomes difficult, so they cost less in the long run.

Finally, at the highest tier of protection, you can buy powered air purifying respirators for $300 or make your own for $15-30. I don't have any experience with either option, so I can't comment much.

Comment by Quintin Pope (quintin-pope) on Gradient descent is not just more efficient genetic algorithms · 2021-09-10T06:52:49.184Z · LW · GW

There should be a fair bit more than 2 epsilon of leeway in the line of equality. Since the submodules themselves are learned by SGD, they won’t be exactly equal. Most likely, the model will include dropout as well. Thus, the signals sent to the combining function are almost always more different than the limits of numerical precision allow. This mean the combining function will need quite a bit of leeway, otherwise the network’s performance is just zero always.

Comment by Quintin Pope (quintin-pope) on Why the technological singularity by AGI may never happen · 2021-09-03T16:01:04.218Z · LW · GW

I think this is plausible, but maybe a bit misleading in terms of real-world implications for AGI power/importance.

Looking at the scaling laws observed for language model pretraining performance vs model size, we see strongly sublinear increases in pretraining performance for linear increases in model size. In figure 3.8 of the GPT-3 paper, we also see that zero/few/many shot transfer learning to SuperGLUE benchmarks also scale sublinearly with model size.

However, the economic usefulness of a system depends on a lot more than just parameter count. Consider that Gorillas have 56% as many cortical neurons as humans (9.1 vs 16.3 billion; see this list), but a human is much more than twice as economically useful as a gorilla. Similarly, a merely human level AGI that was completely dedicated to accomplishing a given goal would likely be far more effective than a human. E.g., see the appendix of this Gwern post (under "On the absence of true fanatics") for an example of how 100 perfectly dedicated (but otherwise ordinary) fanatics could likely destroy Goldman Sachs, if each were fully willing to dedicate years of hard work and sacrifice their lives to do so.

Comment by Quintin Pope (quintin-pope) on Thoughts on gradient hacking · 2021-09-03T15:20:44.909Z · LW · GW

If gradient hacking is thought to be possible because gradient descent is a highly local optimization process, maybe it would help to use higher-order approaches. E.g., Newton's method uses second order derivative information, and the Householder methods use even higher order derivatives.

These higher order methods aren't commonly used in deep learning because of their additional computational expense. However, if such methods can detect and remove mechanisms of gradient hacking that are invisible to gradient descent, it maybe be worthwhile to occasionally use higher order methods in training.

Comment by Quintin Pope (quintin-pope) on Randal Koene on brain understanding before whole brain emulation · 2021-08-25T22:50:30.744Z · LW · GW

I think it’s plausible we’ll be able to use deep learning to model a brain well before we understand how the brain works.

  1. Record a ton of brain activity + human behaviour with a brain computer interface and wearable recording devises, respectively.
  2. Train a model to predict future brain activity + behaviour, conditioned on past brain activity + behaviour.
  3. Continue running the model by feeding it its own predicted brain activity + behaviour as the conditioning data for future predictions.

Congratulations, you now have an emulated human. No need to understand any brain algorithms. You just need tons of brain + behaviour data and compute. I think this will be possible before non brain-based AGI because current AI research indicates it’s easier to train a model by distilling/imitating an already trained model than it is to train from scratch, e.g., DistilBERT: https://arxiv.org/abs/1910.01108v4

Comment by Quintin Pope (quintin-pope) on Dangerous Virtual Worlds · 2021-08-11T22:39:48.429Z · LW · GW

Do you view art, literature, meditation or pet care similarly?

Comment by Quintin Pope (quintin-pope) on Dangerous Virtual Worlds · 2021-08-11T17:06:01.517Z · LW · GW

I thought that if things got significantly more intense I might have a heart attack and die!

I was initially skeptical that this was a risk worth considering. I've heard anecdotes of people dying of excitement, but seemed like a "shark attack" sort of risk that's more discussed than experienced. However, some Googling revealed "Cardiovascular Events during World Cup Soccer", which finds that cardiac incidents were 2.66x higher on days the German team competed during the 2006 soccer world cup. FIFA's website says an average of ~21.9 million people watched each match. This website says Germany had a population of 81,472,235 in 2006. 

If we attribute 100% of the 2.66x increase to 21.9 million soccer fans being more excited on those days (as opposed to getting less sleep, drinking more alcohol, etc.), then we get (CV_risk_x * 21.9 + 59.57) / 81.47 = 2.66, so CV_risk_x = 7.18x higher risk due to extreme excitement. If we arbitrarily attribute 33% of the increase to excitement, we get (CV_risk_x * 21.9 + 59.57) / 81.47 = 1.548, and CV_risk_x = 3.04x.

That's higher than I expected, but still not too bad, especially if your current risk is low. I think virtual reality in particular is less of a risk than many other high-excitement activities because it involves more exertion than, say, normal video games or reading. I expect the increased exertion on net more than balances out any excitement risks.

Comment by Quintin Pope (quintin-pope) on How much compute was used to train DeepMind's generally capable agents? · 2021-07-31T04:22:42.842Z · LW · GW

Your link says rats have ~200 million neurons, but I think synapses are a better comparison for NN parameters. After all, both synapses and parameters roughly store how strong the connections between different neurons are.

Using synapse count, these agents are closer to guppies than to rats.

Comment by Quintin Pope (quintin-pope) on DeepMind: Generally capable agents emerge from open-ended play · 2021-07-29T22:01:23.977Z · LW · GW

The summary says they use text and a search for “text” in the paper gives this on page 32:

“In these past works, the goal usually consists of the position of the agent or a target observation to reach, however some previous work uses text goals (Colas et al., 2020) for the agent similarly to this work.”

So I thought they provided goals as text. I’ll be disappointed if they don’t. Hopefully, future work will do so (and potentially use pretrained LMs to process the goal texts).

Comment by Quintin Pope (quintin-pope) on DeepMind: Generally capable agents emerge from open-ended play · 2021-07-28T02:49:26.821Z · LW · GW

There are people who've been blind from birth. They're still generally intelligent. I think general intelligence is mostly applying powerful models to huge amounts of rich data. Human senses are sufficiently rich even without vision.

Also, there are lots of differences between human brains and current neural nets. E.g., brains are WAY more powerful than current NNs and train for years on huge amounts of incredibly rich sensory data.

Comment by Quintin Pope (quintin-pope) on DeepMind: Generally capable agents emerge from open-ended play · 2021-07-28T02:41:33.858Z · LW · GW

What really impressed me were the generalized strategies the agent applied to multiple situations/goals. E.g., "randomly move things around until something works" sounds simple, but learning to contextually apply that strategy 

  1. to the appropriate objects, 
  2. in scenarios where you don't have a better idea of what to do, and 
  3. immediately stopping when you find something that works 

is fairly difficult for deep agents to learn. I think of this work as giving the RL agents a toolbox of strategies that can be flexibly applied to different scenarios. 

I suspect that finetuning agents trained in XLand in other physical environments will give good results because the XLand agents already know how to use relatively advanced strategies. Learning to apply the XLand strategies to the new physical environments will probably be easier than starting from scratch in the new environment.