Posts

Eliciting Latent Knowledge Via Hypothetical Sensors 2021-12-30T15:53:30.450Z
Why GPT wants to mesa-optimize & how we might change this 2020-09-19T13:48:30.348Z
John_Maxwell's Shortform 2020-09-11T20:55:20.409Z
Are HEPA filters likely to pull COVID-19 out of the air? 2020-03-25T01:07:18.833Z
Comprehensive COVID-19 Disinfection Protocol for Packages and Envelopes 2020-03-15T10:00:33.170Z
Why don't singularitarians bet on the creation of AGI by buying stocks? 2020-03-11T16:27:20.600Z
When are immunostimulants/immunosuppressants likely to be helpful for COVID-19? 2020-03-05T21:44:08.288Z
The Goodhart Game 2019-11-18T23:22:13.091Z
Self-Fulfilling Prophecies Aren't Always About Self-Awareness 2019-11-18T23:11:09.410Z
What AI safety problems need solving for safe AI research assistants? 2019-11-05T02:09:17.686Z
The problem/solution matrix: Calculating the probability of AI safety "on the back of an envelope" 2019-10-20T08:03:23.934Z
The Dualist Predict-O-Matic ($100 prize) 2019-10-17T06:45:46.085Z
Replace judges with Keynesian beauty contests? 2019-10-07T04:00:37.906Z
Three Stories for How AGI Comes Before FAI 2019-09-17T23:26:44.150Z
How to Make Billions of Dollars Reducing Loneliness 2019-08-30T17:30:50.006Z
Response to Glen Weyl on Technocracy and the Rationalist Community 2019-08-22T23:14:58.690Z
Proposed algorithm to fight anchoring bias 2019-08-03T04:07:41.484Z
Raleigh SSC/LW/EA Meetup - Meet MealSquares People 2019-05-08T00:01:36.639Z
The Case for a Bigger Audience 2019-02-09T07:22:07.357Z
Why don't people use formal methods? 2019-01-22T09:39:46.721Z
General and Surprising 2017-09-15T06:33:19.797Z
Heuristics for textbook selection 2017-09-06T04:17:01.783Z
Revitalizing Less Wrong seems like a lost purpose, but here are some other ideas 2016-06-12T07:38:58.557Z
Zooming your mind in and out 2015-07-06T12:30:58.509Z
Purchasing research effectively open thread 2015-01-21T12:24:22.951Z
Productivity thoughts from Matt Fallshaw 2014-08-21T05:05:11.156Z
Managing one's memory effectively 2014-06-06T17:39:10.077Z
OpenWorm and differential technological development 2014-05-19T04:47:00.042Z
System Administrator Appreciation Day - Thanks Trike! 2013-07-26T17:57:52.410Z
Existential risks open thread 2013-03-31T00:52:46.589Z
Why AI may not foom 2013-03-24T08:11:55.006Z
[Links] Brain mapping/emulation news 2013-02-21T08:17:27.931Z
Akrasia survey data analysis 2012-12-08T03:53:35.658Z
Akrasia hack survey 2012-11-30T01:09:46.757Z
Thoughts on designing policies for oneself 2012-11-28T01:27:36.337Z
Room for more funding at the Future of Humanity Institute 2012-11-16T20:45:18.580Z
Empirical claims, preference claims, and attitude claims 2012-11-15T19:41:02.955Z
Economy gossip open thread 2012-10-28T04:10:03.596Z
Passive income for dummies 2012-10-27T07:25:33.383Z
Morale management for entrepreneurs 2012-09-30T05:35:05.221Z
Could evolution have selected for moral realism? 2012-09-27T04:25:52.580Z
Personal information management 2012-09-11T11:40:53.747Z
Proposed rewrites of LW home page, about page, and FAQ 2012-08-17T22:41:57.843Z
[Link] Holistic learning ebook 2012-08-03T00:29:54.003Z
Brainstorming additional AI risk reduction ideas 2012-06-14T07:55:41.377Z
Marketplace Transactions Open Thread 2012-06-02T04:31:32.387Z
Expertise and advice 2012-05-27T01:49:25.444Z
PSA: Learn to code 2012-05-25T18:50:01.407Z
Knowledge value = knowledge quality × domain importance 2012-04-16T08:40:57.158Z
Rationality anecdotes for the homepage? 2012-04-04T06:33:32.097Z

Comments

Comment by John_Maxwell (John_Maxwell_IV) on Where I agree and disagree with Eliezer · 2022-06-23T23:47:33.192Z · LW · GW
  • Power makes you dumb, stay humble.

  • Tell everyone in the organization that safety is their responsibility, everyone's views are important.

  • Try to be accessible and not intimidating, admit that you make mistakes.

  • Schedule regular chats with underlings so they don't have to take initiative to flag potential problems. (If you think such chats aren't a good use of your time, another idea is to contract someone outside of the organization to do periodic informal safety chats. Chapter 9 is about how organizational outsiders are uniquely well-positioned to spot safety problems. Among other things, it seems workers are sometimes more willing to share concerns frankly with an outsider than they are with their boss.)

  • Accept that not all of the critical feedback you get will be good quality.

The book disrecommends anonymous surveys on the grounds that they communicate the subtext that sharing your views openly is unsafe. I think anonymous surveys might be a good idea in the EA community though -- retaliation against critics seems fairly common here (i.e. the culture of fear didn't come about by chance). Anyone who's been around here long enough will have figured out that sharing your views openly isn't safe. (See also the "People are pretty justified in their fears of critiquing EA leadership/community norms" bullet point here, and the last paragraph in this comment.)

Comment by John_Maxwell (John_Maxwell_IV) on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment · 2022-06-23T11:40:04.016Z · LW · GW

Fair point. I also haven't done much posting since adding the bounty to my profile. Was thinking it might attract the attention of people reading the archives, but maybe there just aren't many archive readers.

Comment by John_Maxwell (John_Maxwell_IV) on How do I use caffeine optimally? · 2022-06-23T01:26:14.489Z · LW · GW

There is some observational evidence that coffee drinking increases lifespan. I think the proposed mechanism has to do with promoting autophagy. https://www.acpjournals.org/doi/10.7326/M21-2977 But it looks like decaf works too. (Decaf has a bit of caffeine.)

I think somewhere else I read that unfiltered coffee doesn't improve lifespan, so try to drink the filtered stuff?

In my experience caffeine dependence is not a big deal and might help my sleep cycle.

Comment by John_Maxwell (John_Maxwell_IV) on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment · 2022-06-22T23:20:34.876Z · LW · GW

Eliezer is a good example of someone who built a lot of status on the back of "breaking" others' unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.

Fair enough.

My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I'm also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.

Yeah personally building feels more natural to me.

I agree a leaderboard would be great. I think it'd be cool to have a leaderboard for proposals as well -- "this proposal has been unbroken for X days" seems like really valuable information that's not currently being collected.

I don't think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.

In the absence of coordination, I think if someone like you was to simply start advertising themselves as an "uberbreaker" who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a "pre-Eliezer" person who I can run my ideas by in a lower stakes context, as opposed to saying "Hey Eliezer, I solved alignment -- wallop me if I'm wrong!"

Comment by John_Maxwell (John_Maxwell_IV) on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment · 2022-06-22T10:42:31.164Z · LW · GW

I wrote a comment on your post with feedback.

I don't have anything prepared for red teaming at the moment -- I appreciate the offer though! Can I take advantage of it in the future? (Anyone who wants to give me critical feedback on my drafts should send me a personal message!)

Comment by John_Maxwell (John_Maxwell_IV) on Getting from an unaligned AGI to an aligned AGI? · 2022-06-22T10:39:21.210Z · LW · GW

I skimmed the post, here is some feedback (context):

  • I'm probably not the best person to red team this since some of my own alignment ideas are along similar lines. I'm also a bit on the optimistic side about alignment more generally -- it might be better to talk to a pessimist.

  • This sounds a bit like the idea of a "low-bandwidth oracle".

  • I think the biggest difficulty is the one you explicitly acknowledged -- boxing is hard.

  • But there are also problems around ensuring that bandwidth is actually limited. If you have a human check to see that the AGI's output conforms to the spec, the AGI could put an infohazard in the output, or write code which looks like it does one thing but actually does another. If you have a computer program or AI check the AI's output, the AGI could exploit a flaw in that program/AI. I think this sort of thing basically gets you a probabilistic safety measure, because there's always a risk that there's a flaw that the superintelligent AGI sees (or can infer) that you don't see. (I like this intuition pump for seeing why these sort of problems are plausible.) I think probabilistic safety measures can be good if we stack a lot of them together in the right way.

  • The idea of emitting machine-checkable proofs is interesting. I'm not sure such proofs are very useful though. "Finding the right spec is one of the biggest challenges in formal methods." - source. And finding the right spec seems more difficult to outsource to an unfriendly AI. In general, I think using AI to improve software reliability seems good, and tractable.

I think you'll find it easier to get feedback if you keep your writing brief. Assume the reader's time is valuable. Sentences like "I will mention some stuff later that maybe will make it more clear how I’d think about such a question." should simply be deleted -- make huge cuts. I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post. Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.

Similarly, I think it is worth investing in clarity. If a sentence is unclear, I have a tendency to just keep reading and not mention it unless I have a prior that the author knows what they're talking about. (The older I get, the more I assume that unclear writing means the author is confused and ignorable.) I like writing advice from Paul Graham and Scott Adams.

Personally I'm more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading. I don't have much time to do feedback right now unfortunately.

Comment by John_Maxwell (John_Maxwell_IV) on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment · 2022-06-22T08:01:05.825Z · LW · GW

Thanks for the reply!

As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they're working on, and they've generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying "there is a desperate need for critical feedback".)

I'd put myself in this category as well -- I used to write a lot of posts and especially comments here on LW summarizing how I'd go about solving some aspect or another of the alignment problem, hoping that Cunningham's Law would trigger someone to point out a flaw in my approach. (In some cases I'd already have a flaw in mind along with a way to address it, but I figured it'd be more motivating to wait until someone mentioned a particular flaw in the simple version of the proposal before I mentioned the fix for it.)

Anyway, it seemed like people often didn't take the bait. (Thanks to everyone who did!) Even with offering $1000 to change my view, as I'm doing in my LW user profile now, I've had 0 takers. I stopped posting on LW/AF nearly as much partially because it has seemed more efficient to try to shoot holes in my ideas myself. On priors, I wouldn't have expected this to be true -- I'd expect that someone else is going to be better at finding flaws in my ideas than I am myself, because they'll have a different way of looking at things which could address my blind spots.

Lately I've developed a theory for what's going on. You might be familiar with the idea that humans are often subconsciously motivated by the need to acquire & defend social status. My theory is that there's an asymmetry in the motivations for alignment building & breaking work. The builder has an obvious status motive: If you become the person who "solved AI alignment", that'll be really good for your social status. That causes builders to have status-motivated blindspots around weak points in their ideas. However, the breaker doesn't have an obvious status motive. In fact, if you go around shooting down peoples' ideas, that's liable to annoy them, which may hurt your social status. And since most proposals are allegedly easily broken anyways, you aren't signaling any kind of special talent by shooting them down. Hence the "breaker" role ends up being undervalued/disincentivized. Especially doing anything beyond just saying "that won't work" -- finding a breaker who will describe a failure in detail instead of just vaguely gesturing seems really hard. (I don't always find such handwaving persuasive.)

I think this might be why Eliezer feels so overworked. He's staked a lot of reputation on the idea that AI alignment is a super hard problem. That gives him a unique status motive to play the red team role, which is why he's had a hard time replacing himself. I think maybe he's tried to compensate for this by making it low status to make a bad proposal, in order to browbeat people into self-critiquing their proposals. But this has a downside of discouraging the sharing of proposals in general, since it's hard to predict how others will receive your ideas. And punishments tend to be bad for creativity.

So yeah, I don't know if the tournament idea would have the immediate effect of generating deep insights. But it might motivate people to share their ideas, or generate better feedback loops, or better align overall status motives in the field, or generate a "useless" blacklist which leads to a deep insight, or filter through a large number of proposals to find the strongest ones. If tournaments were run on a quarterly basis, people could learn lessons, generate some deep ideas from those lessons, and spend a lot of time preparing for the next tournament.

A few other thoughts...

it's going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders

Perhaps we could mitigate this by allowing breakers to just characterize how something might fail in vague terms -- obviously not as good as a specific description, but still provides some signal to iterate on.

It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it's probably a worthwhile one.

I think something like a realtime Slack discussion could be pretty engaging. I think there is room for both high-deliberation and low-deliberation formats. [EDIT: You could also have a format in between, where the blue team gets little time, and the red team gets lots of time, to try to simulate the difference in intelligence between an AGI and its human operators.] Also, I'd expect even a slow, high-deliberation tournament format to be more engaging than the way alignment research often gets done (spend a bunch of time thinking on your own, write a post, observe post score, hopefully get a few good comments, discussion dies out as post gets old).

Comment by John_Maxwell (John_Maxwell_IV) on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment · 2022-06-22T04:59:07.059Z · LW · GW

Thanks for writing this! Do you have any thoughts on doing a red team/blue team alignment tournament as described here?

Comment by John_Maxwell (John_Maxwell_IV) on Where I agree and disagree with Eliezer · 2022-06-20T11:25:09.111Z · LW · GW

Chapter 7 in this book had a few good thoughts on getting critical feedback from subordinates, specifically in the context of avoiding disasters. The book claims that merely encouraging subordinates to give critical feedback is often insufficient, and offers ideas for other things to do.

Comment by John_Maxwell (John_Maxwell_IV) on Book Review: Talent · 2022-06-04T11:04:07.939Z · LW · GW

And just as I was writing this I came across another good example of the ‘you think you’re in competition with others like you but mostly you’re simply trying to be good enough’

I'm straight, so possibly unreliable, but I remember Michael Curzi as a very good-looking guy with a deep sexy voice. I believe him when he says other dudes are not competition for him 95% of the time. ;-)

Comment by John_Maxwell (John_Maxwell_IV) on Open Thread - Jan 2022 [Vote Experiment!] · 2022-01-23T04:14:53.024Z · LW · GW

I wrote a comment here arguing that voting systems tend to encourage conformity. I think this is a way in which the LW voting system could be improved. You might get rid of the unlabeled quality axis and force downvoters to be specific about why they dislike the comment. Maybe readers could specify which weights they want to assign to the remaining axes in order to sort comments.

I think Agree/Disagree is better than True/False, and Understandable/Confusing would be better than Clear/Muddled. Both of these axes are functions of two things (the reader and the comment) rather than just one (the comment) and the existing labels implicitly assume that the person voting on the comment has a better perspective on it than the person who wrote it. I think the opposite is more likely true -- speaking personally at least, my votes tend to be less thoughtful than my comments.

Other axis ideas: Constructive/nonconstructive, important/unimportant. Could also try a "thank" react, and an "intriguing" or "interesting" react (probably replacing "surprise" -- I like the idea of reinforcing novelty but the word "surprise" seems like too high of a bar?) Maybe also reacts for "this comment should've been longer/shorter"?

Comment by John_Maxwell (John_Maxwell_IV) on Counterexamples to some ELK proposals · 2022-01-01T07:07:43.080Z · LW · GW

I'll respond to the "Predict hypothetical sensors" section in this comment.

First, I want to mention that predicting hypothetical sensors seems likely to fail in fairly obvious ways, e.g. you request a prediction about a sensor that's physically nonexistent and the system responds with a bunch of static or something. Note the contrast with the "human simulator" failure mode, which is much less obvious.

But I also think we can train the system to predict hypothetical sensors in a way that's really useful. As in my previous comment, I'll work from the assumptions (fairly weak IMO) that

  1. We can control the data our systems get.

  2. We are capable of doing regular old supervised learning -- possibly in conjunction with transfer learning that gives the system generic prior knowledge like the meaning of English words, but not specific prior knowledge like details of our situation (unless we want that). Our supervised learning finds a function which maps training examples in X to labels in Y (labels may or may not correspond to "reality").

In particular, these assumptions imply that our system doesn't necessarily need to know whether a sensor it's trying to predict exists physically (or if it would be physically possible to build).

But what if over the course of its operation, the system accidentally learns that a sensor of interest doesn't exist? E.g. because it points a sensor that does exist in the direction of the one that doesn't exist, and there's nothing present. Ideally we could be reasonably confident of good "readings" from the nonexistent sensor even past that point.

To achieve this, we could make use of the "regular old supervised learning" assumption and construct a doctored dataset as follows:

  1. Place camera S1 in the center of the vault, turn it on, and have it noisily rotate 360 degrees to capture a panorama.

  2. Mount audio sensor S2 on the interior wall of the vault, turn both sensors on, and have S1 do the same panorama.

  3. Create a supervised learning problem to predict the readings of S2 gathered in step 2, given S1's panorama from step 1.

If we have prediction problems like this in our dataset, hypotheses which predict that S2 continues behaving as normal even if physically absent will achieve a lower loss, and should be preferentially chosen.

A similar method lets us preferentially choose hypotheses that predict that S2 will emit a special "TAMPERING ACTIVE" signal during tampering, then work as normal once tampering is complete:

  1. Place audio sensor S2 in the corner of the vault and turn it on.

  2. Move a robotic arm right next to S2.

  3. Have the arm tamper with S2 so it no longer hears the sound of footfalls.

  4. Discard S2 and replace it with a fresh new sensor, S2', that has identical factory specifications.

  5. Have someone walk beneath S2' in full view of camera sensor S1.

  6. Splice together a composite audio sensor reading with the following sequence: S2 readings from steps 1 and 2, a "TAMPERING ACTIVE" signal for the duration of step 3, nothing for step 4, and the S2' readings from step 5.

  7. Create a supervised learning problem to predict the composite reading from step 6, given the movements of the robot arm from steps 2 and 3, plus the visual readings from S1 with step 4 snipped out.

The "S2 is vulnerable to tampering" hypothesis predicts that footfalls should not be present in the audio. But since they are present in the audio, it will achieve a higher loss. Failure to predict the emission of the "TAMPERING ACTIVE" signal also gives this hypothesis a higher loss.

It might also be possible to do away with the "TAMPERING ACTIVE" signal somehow, e.g. by making S2 and S2' identical except that the tampering which works on S2 doesn't work on S2', then splicing together readings from the robot arm doing the exact same tamper sequence on both sensors.

Finally, here are a few ideas for preferentially choosing hypotheses that predict accurate readings from technologically impossible sensors.

Supersensor Idea #1

  1. Write a natural language description of a "telepathic sensor". Describe it as being mounted right above a camera sensor S1.

  2. Place boxes X, Y, and Z within the vault beneath the watchful eye of S1.

  3. Turn on S1. Have a person enter the vault, visibly ponder for a moment, and open box Y.

  4. Create a stream of "telepathic sensor" readings. Find a timestamp in the video where the person is pondering and hasn't yet made a movement. Add a reading from the "telepathic sensor" at that timestamp: "The person intends to open box Y".

Supersensor Idea #2

  1. Write a natural language description of an "x-ray vision" sensor which can be pointed in various directions. Describe it as being mounted right above a camera sensor S1.

  2. Place a deck of cards on a table beneath the watchful eye of S1.

  3. Have a person enter the vault, shuffle the deck beneath the table, show a 9 of hearts to S1, and place it face down on top of the deck.

  4. Create a stream of "x-ray vision" sensor readings. Specify that the sensor is initially pointed away, but once the action finishes, the sensor turns towards the table and registers: "The top card of the deck is a 9 of hearts."

Supersensor Idea #3

  1. Write a natural language description of a "nanobot sensor" which beeps whenever a nanobot is present within 10 feet.

  2. Run an action sequence which builds nanobots and moves them within 10 feet of where the "nanobot sensor" is mounted.

  3. Create a stream of "nanobot sensor" readings. Specify that the sensor starts beeping as soon as the nanobots are nearby.

A few more thoughts on translating natural language into sensor readings:

  • Excise a real sensor type from system's prior knowledge. Ask your system to predict sensor data from a physical instance of this sensor, given a natural language description of its workings plus other readings from the environment. (H/T Romeo Stevens)

  • Make a weird modified sensor (e.g. a camera sensor which provides an upside down picture). Ask your system to predict readings from the modified sensor, given a natural language description of its modifications plus other readings from the environment.

Anyway, I'm not sure we need to reason about a physically impossible counterfactual or condition on the sensor existing physically. It seems perfectly coherent to ask "what is the answer to this thought experiment" rather than "if this sensor existed, what would it see"? For example, instead of the question "what would an Acme Corp camera mounted here see", consider the question "if the light which passes through a pinhole at these coordinates intersected with a plane at these other coordinates, and the intersections were digitized and formatted the same way Acme Corp cameras format photos, what would be the resulting binary file?"

Humans don't seem to have a lot of trouble performing thought experiments. If the system tries to fit the data with a hypothesis that references existing pretrained conceptual understanding, as I described above, that could give the system an inductive bias towards performing human-like thought experiments. This could be bad if human thought experiments are vulnerable to human deficiencies. It could also be good if we'd like the AI's hypothetical sensors to behave in the same intuitive way our thought experiments do.

One possible concern is hypotheses which reference dataset doctoring. Obviously one could try to excise knowledge of that possibility. Another quick idea is to try & train a classifier to differentiate doctored vs non-doctored SmartVault sequences, and keep improving our fakes until the classifier can't easily tell the difference? Or try to avoid any sort of branching so the system always acts like it's dealing with a doctored dataset when in production? Could even fuzz the live data stream in a way that makes it appear doctored ;-) Finally, to get a sense for the cognitive signature of a doctoring-related hypothesis, one could train the system to solve some problems where the only way to minimize the loss is to think a lot about doctoring. Maybe a classifier which aims to detect the presence of doctoring-related cognition could be useful here.

Another possibility is an alternative hypothesis along the lines of "predict what the operator would want me to predict" -- unclear if that's desirable?

Comment by John_Maxwell (John_Maxwell_IV) on Counterexamples to some ELK proposals · 2022-01-01T02:29:00.242Z · LW · GW

Thanks for the reply! I'll respond to the "Hold out sensors" section in this comment.

One assumption which seems fairly safe in my mind is that as the operators, we have control over the data our AI gets. (Another way of thinking about it is if we don't have control over the data our AI gets, the game has already been lost.)

Given that assumption, this problem seems potentially solvable

Moreover, my AI may be able to deduce the presence of the additional sensors very cheaply. Perhaps it can notice the sensors, or it can learn about my past actions to get a hint about where I may have placed them. If this is possible, then “Predict the readings of all sensors” need not be much more complex than “Predict the readings of one sensor.”

If the SmartVault learns a policy from data which was all generated prior to the installation of the new sensors, it seems unlikely that policy would deliberately account for the existence of those specific new sensors. We could leave info about your past actions out of the dataset as well.

An alternative approach in response to this bit:

I’m concerned that I will learn a better version of the human simulator which predicts the readings of all sensors and then outputs what a human would infer from the complete set.

The scenario is: we're learning a function F1(A, S1) -> D where A is an action sequence, S1 is readings from the known sensor, and D is a diamond location. Previously we've discussed two functions which both achieve perfect loss on our training data:

  • D1(A, S1) -- a direct translator which takes A and S1 into account

  • H1(A, S1) -- a simulation of what a human would believe if they saw A and S1

Let's also consider two other functions:

  • D2(A, S1, S2) -- a direct translator which takes A, S1, and S2 into account

  • H2(A, S1, S2) -- a simulation of what a human would believe if they saw A, S1, and S2

Your concern is that there is a third function on the original (A, S1) domain which also achieves perfect loss:

  • H1'(A, S1) = H2(A, S1, P_S2(A, S1)) -- defining P_S2 as a prediction of S2's readings given A & S1, we have H1' as a simulation of what a human would believe if they saw A, S1, and readings for S2 predicted from A & S1.

Why would it be bad if gradient descent discovered H1'? Because then when it comes time to learn a policy, we incentivize policies which deceive predicted readings for S2 in addition to S1.

Here's an idea for obtaining a function on the original (A, S1) domain which does not incentivize policies which deceive S2:

  1. Learn a function F2 on the expanded domain (A, S1, S2), using a training set which is carefully constructed so that the only way to achieve perfect loss is to do a good job of taking readings from S2 into account. (For example, deliberately construct scenarios where the readings from S2 are not what you would expect if you were only looking at A and S1, and make ignoring A & S1 in favor of S2 key to labeling those scenarios correctly.) F2 could be closer to either D2 or H2, I don't think it matters much.

  2. Define a function F1_only(A, S1) = F2(A, S1, <hardcoded stream of boring S2 sensor data>).

Now let's use F1_only as the target for learning our policy. I argue a policy has no incentive to deceive S2, because we know that F2 has been optimized to trust its S2 argument over its A and S1 arguments regarding what is going on around S2, and when we call F2 through F1_only, its S2 argument will always be telling it there are no interesting readings coming from S2. So, no bonus points for a policy which tries to fool S2 in addition to S1.

Maybe there is some kind of unintended consequence to this weird setup; I just came up with it and it's still a bit half-baked in my mind. (Perhaps you could make some kind of exotic argument on the basis of inner optimizers and acausal trade between different system components?) But the meta point is there's a lot of room for creativity if you don't anthropomorphize and just think in terms of learning functions on datasets. I think the consequences of the "we control the data our AIs get" assumption could be pretty big if you're willing to grant it.

Comment by John_Maxwell (John_Maxwell_IV) on ARC's first technical report: Eliciting Latent Knowledge · 2021-12-30T16:34:12.367Z · LW · GW

I wrote a post in response to the report: Eliciting Latent Knowledge Via Hypothetical Sensors.

Some other thoughts:

  • I felt like the report was unusually well-motivated when I put my "mainstream ML" glasses on, relative to a lot of alignment work.

  • ARC's overall approach is probably my favorite out of alignment research groups I'm aware of. I still think running a builder/breaker tournament of the sort proposed at the end of this comment could be cool.

  • Not sure if this is relevant in practice, but... the report talks about Bayesian networks learned via gradient descent. From what I could tell after some quick Googling, it doesn't seem all that common to do this, and it's not clear to me if there has been any work at all on learning the node structure (as opposed to internal node parameters) via gradient descent. It seems like this could be tricky because the node structure is combinatorial in nature and thus less amenable to a continuous optimization technique like gradient descent.

  • There was recently a discussion on LW about a scenario similar to the SmartVault one here. My proposed solution was to use reward uncertainty -- as applied to the SmartVault scenario, this might look like: "train lots of diverse mappings between the AI's ontology and that of the human; if even one mapping of a situation says the diamond is gone according to the human's ontology, try to figure out what's going on". IMO this general sort of approach is quite promising, interested to discuss more if people have thoughts.

Comment by John_Maxwell (John_Maxwell_IV) on The Plan · 2021-12-21T11:56:42.823Z · LW · GW

(Well, really I expect it to take <12 months, but planning fallacy and safety margins and time to iterate a little and all that.)

There's also red teaming time, and lag in idea uptake/marketing, to account for. It's possible that we'll have the solution to FAI when AGI gets invented, but the inventor won't be connected to our community and won't be aware of/sold on the solution.

Edit: Don't forget to account for the actual engineering effort to implement the safety solution and integrate it with capabilities work. Ideally there is time for extensive testing and/or formal verification.

Comment by John_Maxwell (John_Maxwell_IV) on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-12T06:14:36.049Z · LW · GW

Yes, if you've just created it, then the criteria are meaningfully different in that case for a very limited time.

It's not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?

I don't see how we're getting off track. (Your original statement was: 'One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.' If we're discussing situations where that claim may be false, it seems to me we're still on track.) But you shouldn't feel obligated to reply if you don't want to. Thanks for your replies so far, btw.

Comment by John_Maxwell (John_Maxwell_IV) on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-11T12:52:35.652Z · LW · GW

My point is that plan execution can't be decoupled successfully from plan generation in this way. "Outputting a plan" is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.

"Outputting a plan" may technically constitute an action, but a superintelligent system (defining "superintelligent" as being able to search large spaces quickly) might not evaluate its effects as such.

Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.

I think you're making a lot of assumptions here. For example, let's say I've just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria "plans which lead to lots of paperclips if shown to humans"? If not, I'd say there's an important effective difference.

If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there's the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it's trying to shift me towards with the plans it displays.

Comment by John_Maxwell (John_Maxwell_IV) on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-09T21:25:29.265Z · LW · GW

The main problem is that "acting via plans that are passed to humans" is not much different from "acting via plans that are passed to robots" when the AI is good enough at modelling humans.

I agree this is true. But I don't see why "acting via plans that are passed to humans" is what would happen.

I mean, that might be a component of the plan which is generated. But the assumption here is that we've decoupled plan generation from plan execution successfully, no?

So we therefore know that the plan we're looking at (at least at the top level) is the result of plan generation, not the first step of plan execution (as you seem to be implicitly assuming?)

The AI is searching for plans which score highly according to some criteria. The criteria of "plans which lead to lots of paperclips if implemented" is not the same as the criteria of "plans which lead to lots of paperclips if shown to humans".

Comment by John_Maxwell (John_Maxwell_IV) on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-08T19:25:33.960Z · LW · GW

I agree these are legitimate concerns... these are the kind of "deep" arguments I find more persuasive.

In that thread, johnswentworth wrote:

In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.

I'd solve this by maintaining uncertainty about the "reward signal", so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. (It doesn't know which is which, but it tries to learn a sufficiently diverse set of reward signals such that alignment is in there somewhere. I don't think we can do any better than this, because the entire point is that there is no way to disambiguate between alignment and the actual-process-which-generates-the-reward-signal by gathering more data. Well, I guess maybe you could do it with interpretability or the right set of priors, but I would hesitate to make those load-bearing.)

(BTW, potentially interesting point I just thought of. I'm gonna refer to actual-process-which-generates-the-reward-signal as "approval". Supposing for a second that it's possible to disambiguate between alignment and approval somehow, and we successfully aim at alignment and ignore approval. Then we've got an AI which might deliberately do aligned things we disapprove of. I think this is not ideal, because from the outside this behavior is also consistent with an AI which has learned approval incorrectly. So we'd want to flip the off switch for the sake of caution. Therefore, as a practical matter, I'd say that you should aim to satisfy both alignment and approval anyways. I suppose you could argue that on the basis of the argument I just gave, satisfying approval is therefore part of alignment and thus this is an unneeded measure, but overall the point is that aiming to satisfy both alignment and approval seems to have pretty low costs.)

(I suppose technically you can disambiguate between alignment and approval if there are unaligned things that humans would approve of -- I figure you solve this problem by making your learning algorithm robust against mislabeled data.)

Anyway, you could use a similar approach for the nice plans problem, or you could formalize a notion of "manipulation" which is something like: conditional on the operator viewing this plan, does their predicted favorability towards subsequent plans change on expectation?

Edit: Another thought is that the delta between "approval" and "alignment" seems like the delta between me and my CEV. So to get from "approval" to "alignment", you could ask your AI to locate the actual-process-which-generates-the-labels, and then ask it about how those labels would be different if we "knew more, thought faster, were more the people we wished we were" etc. (I'm also unclear why you couldn't ask a hyper-advanced language model what some respected moral philosophers would think if they were able to spend decades contemplating your question or whatever.)

Another edit: You could also just manually filter through all the icky plans until you find one which is non-icky.

(Very interested in hearing objections to all of these ideas.)

Comment by John_Maxwell (John_Maxwell_IV) on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-08T09:48:56.444Z · LW · GW

One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don't need a deep argument to point out an obvious flaw there.

I don't see the "obvious flaw" you're pointing at and would appreciate a more in-depth explanation.

In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this:

  • You ask your AGI to generate a plan for how it could maximize paperclips.

  • Your AGI generates a plan. "Step 1: Manipulate human operator into thinking that paperclips are the best thing ever, using the following argument..."

  • You stop reading the plan at that point, and don't click "execute" for it.

Comment by John_Maxwell (John_Maxwell_IV) on Interpreting Yudkowsky on Deep vs Shallow Knowledge · 2021-12-07T04:55:18.668Z · LW · GW

For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:

The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.

This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specific gun operations can be deadly and why.)

I'm not saying Eliezer's conclusion is false. I find other arguments for that conclusion much more persuasive, e.g. involving mesa-optimizers, because there is a proposed failure type which I understand in causal/mechanistic terms.

(I can provide other examples of shallow-seeming arguments if desired.)

Comment by John_Maxwell (John_Maxwell_IV) on Visible Thoughts Project and Bounty Announcement · 2021-12-04T00:38:06.889Z · LW · GW

As the proposal stands it seems like the AI's predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.

Might depend whether the "thought" part comes before or after particular story text. If the "thought" comes after that story text, then it's generated conditional on that text, essentially a rationalization of that text from a hypothetical DM's point of view. If it comes before that story text, then the story is being generated conditional on it.

Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)

Comment by John_Maxwell (John_Maxwell_IV) on Why don't singularitarians bet on the creation of AGI by buying stocks? · 2021-11-09T23:29:48.049Z · LW · GW

I updated the post to note that if you want voting rights in Google, it seems you should buy $GOOGL not $GOOG. Sorry! Luckily they are about the same price, and you can easily dump your $GOOG for $GOOGL. In fact, it looks like $GOOGL is $6 cheaper than $GOOG right now? Perhaps because it is less liquid?

Comment by John_Maxwell (John_Maxwell_IV) on How to turn money into AI safety? · 2021-08-28T00:00:10.487Z · LW · GW

Fraud also seems like the kind of problem you can address as it comes up. And I suspect just requiring people to take a salary cut is a fairly effective way to filter for idealism.

All you have to do to distract fraudsters is put a list of poorly run software companies where you can get paid more money to work less hard at the top of the application ;-) How many fraudsters would be silly enough to bother with a fraud opportunity that wasn't on the Pareto frontier?

Comment by John_Maxwell (John_Maxwell_IV) on How to turn money into AI safety? · 2021-08-27T23:52:29.390Z · LW · GW

The problem comes when one tries to pour a lot of money into that sort of approach

It seems to me that the Goodhart effect is actually stronger if you're granting less money.

Suppose that we have a population of people who are keen to work on AI safety. Suppose every time a person from that population gets an application for funding rejected, they lose a bit of the idealism which initially drew them to the area and they start having a few more cynical thoughts like "my guess is that grantmakers want to fund X, maybe I should try to be more like X even though I don't personally think X is a great idea."

In that case, the level of Goodharting seems to be pretty much directly proportional to the number of rejections -- and the less funding available, the greater the quantity of rejections.

On the other hand, if the United Nations got together tomorrow and decided to fund a worldwide UBI, there'd be no optimization pressure at all, and people would just do whatever seemed best to them personally.

EDIT: This appears to be a concrete example of what I'm describing

Comment by John_Maxwell (John_Maxwell_IV) on How to turn money into AI safety? · 2021-08-27T23:27:01.591Z · LW · GW

I think if you're in the early stages of a big project, like founding a pre-paradigmatic field, it often makes sense to be very breadth-first. You can save a lot of time trying to understand the broad contours of solution space before you get too deeply invested in a particular approach.

I think this can even be seen at the microscale (e.g. I was coaching someone on how to solve leetcode problems the other day, and he said my most valuable tip was to brainstorm several different approaches before exploring any one approach in depth). But it really shines at the macroscale ("you built entirely the wrong product because you didn't spend enough time talking to customers and exploring the space of potential offerings in a breadth-first way").

One caveat is that breadth-first works best if you have a good heuristic. For example, if someone with less than a year of programming experience was practicing leetcode problems, I wouldn't emphasize the importance of brainstorming multiple approaches as much, because I wouldn't expect them to have a well-developed intuition for which approaches will work best. For someone like that, I might recommend going depth-first almost at random until their intuition is developed (random rollouts in the context of monte carlo tree search are a related notion). I think there is actually some psych research showing that more experienced engineers will spend more time going breadth-first at the beginning of a project.

A synthesis of the above is: if AI safety is pre-paradigmatic, we want lots of people exploring a lot of different directions. That lets us understand the broad contours better, and also collects data to help refine our intuitions.

IMO the AI safety community has historically not been great at going breadth-first, e.g. investing a lot of effort in the early days into decision theory stuff which has lately become less fashionable. I also think people are overconfident in their intuitions about what will work, relative to the amount of time which has been spent going depth-first and trying to work out details related to "random" proposals.

In terms of turning money into AI safety, this strategy is "embarrassingly parallel" in the sense that it doesn't require anyone to wait for a standard textbook or training program, or get supervision from some critical person. In fact, having a standard curriculum or a standard supervisor could be counterproductive, since it gets people anchored on a particular frame, which means a less broad area gets explored. If there has to be central coordination, it seems better to make a giant list of literatures which could provide insight, then assign each literature to a particular researcher to acquire expertise in.

After doing parallel exploration, we could do a reduction tree. Imagine if we ran an AI safety tournament where you could sign up as "red team", "blue team", or "judge". At each stage, we generate tuples of (red player, blue player, judge) at random and put them in a video call or a Google Doc. The blue player tries to make a proposal, the red player tries to break it, the judge tries to figure out who won. Select the strongest players on each team at each stage and have them advance to the next stage, until you're left with the very best proposals and the very most difficult to solve issues. Then focus attention on breaking those proposals / solving those issues.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2021-06-09T04:00:50.100Z · LW · GW

Yes, I tried it. It gave me a headache but I would guess that's not common. Think it's probably a decent place to start.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2021-06-07T07:57:28.696Z · LW · GW

I didn't end up sticking to this because of various life disruptions. I think it was a bit helpful but I'm planning to try something more intensive next time.

Comment by John_Maxwell (John_Maxwell_IV) on Testing The Natural Abstraction Hypothesis: Project Intro · 2021-04-09T05:40:04.961Z · LW · GW

I'm glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I'm inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr's posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking into.)

This is despite humans having pretty much identical cognitive architectures (assuming that we can create a de novo AGI with a cognitive architecture as similar to a human brain as human brains are to each other seems unrealistic). Perhaps you could argue that some human-generated abstractions are "natural" and others aren't, but that leaves the problem of ensuring that the human operating our AI is making use of the correct, "natural" abstractions in their own thinking. (Some ancient cultures lacked a concept of the number 0. From our perspective, and that of a superintelligent AGI, 0 is a 'natural' abstraction. But there could be ways in which the superintelligent AGI invents 'natural' abstraction that we haven't yet invented, such that we are living in a "pre-0 culture" with respect to this abstraction, and this would cause an ontological mismatch between us and our AGI.)

But I'm still optimistic about the overall research direction. One reason is if your dataset contains human-generated artifacts, e.g. pictures with captions written in English, then many unsupervised learning methods will naturally be incentivized to learn English-language abstractions to minimize reconstruction error. (For example, if we're using self-supervised learning, our system will be incentivized to correctly predict the English-language caption beneath an image, which essentially requires the system to understand the picture in terms of English-language abstractions. This incentive would also arise for the more structured supervised learning task of image captioning, but the results might not be as robust.)

This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.

Social sciences are a notable exception here. And I think social sciences (or even humanities) may be the best model for alignment--'human values' and 'corrigibility' seem related to the subject matter of these fields.

Anyway, I had a few other comments on the rest of what you wrote, but I realized what they all boiled down to was me having a different set of abstractions in this domain than the ones you presented. So as an object lesson in how people can have different abstractions (heh), I'll describe my abstractions (as they relate to the topic of abstractions) and then explain how they relate to some of the things you wrote.

I'm thinking in terms of minimizing some sort of loss function that looks vaguely like

reconstruction_error + other_stuff

where reconstruction_error is a measure of how well we're able to recreate observed data after running it through our abstractions, and other_stuff is the part that is supposed to induce our representations to be "useful" rather than just "predictive". You keep talking about conditional independence as the be-all-end-all of abstraction, but from my perspective, it is an interesting (potentially novel!) option for the other_stuff term in the loss function. The same way dropout was once an interesting and novel other_stuff which helped supervised learning generalize better (making neural nets "useful" rather than just "predictive" on their training set).

The most conventional choice for other_stuff would probably be some measure of the complexity of the abstraction. E.g. a clustering algorithm's complexity can be controlled through the number of centroids, or an autoencoder's complexity can be controlled through the number of latent dimensions. Marcus Hutter seems to be as enamored with compression as you are with conditional independence, to the point where he created the Hutter Prize, which offers half a million dollars to the person who can best compress a 1GB file of Wikipedia text.

Another option for other_stuff would be denoising, as we discussed here.

You speak of an experiment to "run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional". My guess is if the other_stuff in your loss function consists only of conditional independence things, your representation won't be particularly low-dimensional--your representation will see no reason to avoid the use of 100 practically-redundant dimensions when one would do the job just as well.

Similarly, you speak of "a system which provably learns all learnable abstractions", but I'm not exactly sure what this would look like, seeing as how for pretty much any abstraction, I expect you can add a bit of junk code that marginally decreases the reconstruction error by overfitting some aspect of your training set. Or even junk code that never gets run / other functional equivalences.

The right question in my mind is how much info at a distance you can get for how many additional dimensions. There will probably be some number of dimensions N such that giving your system more than N dimensions to play with for its representation will bring diminishing returns. However, that doesn't mean the returns will go to 0, e.g. even after you have enough dimensions to implement the ideal gas law, you can probably gain a bit more predictive power by checking for wind currents in your box. See the elbow method (though, the existence of elbows isn't guaranteed a priori).

(I also think that an algorithm to "provably learn all learnable abstractions", if practical, is a hop and a skip away from a superintelligent AGI. Much of the work of science is learning the correct abstractions from data, and this algorithm sounds a lot like an uberscientist.)

Anyway, in terms of investigating convergence, I'd encourage you to think about the inductive biases induced by both your loss function and also your learning algorithm. (We already know that learning algorithms can have different inductive biases than humans, e.g. it seems that the input-output surfaces for deep neural nets aren't as biased towards smoothness as human perceptual systems, and this allows for adversarial perturbations.) You might end up proving a theorem which has required preconditions related to the loss function and/or the algorithm's inductive bias.

Another riff on this bit:

This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.

Maybe we could differentiate between the 'useful abstraction hypothesis', and the stronger 'unique abstraction hypothesis'. This statement supports the 'useful abstraction hypothesis', but the 'unique abstraction hypothesis' is the one where alignment becomes way easier because we and our AGI are using the same abstractions. (Even though I'm only a believer in the useful abstraction hypothesis, I'm still optimistic because I tend to think we can have our AGI cast a net wide enough to capture enough useful abstractions that ours are in their somewhere, and this number will be manageable enough to find the right abstractions from within that net--or something vaguely like that.) In terms of science, the 'unique abstraction hypothesis' doesn't just say scientific theories can be useful, it also says there is only one 'natural' scientific theory for any given phenomenon, and the existence of competing scientific schools sorta seems to disprove this.

Anyway, the aspect of your project that I'm most optimistic about is this one:

This raises another algorithmic problem: how do we efficiently check whether a cognitive system has learned particular abstractions? Again, this doesn’t need to be fully general or arbitrarily precise. It just needs to be general enough to use as a tool for the next step.

Since I don't believe in the "unique abstraction hypothesis", checking whether a given abstraction corresponds to a human one seems important to me. The problem seems tractable, and a method that's abstract enough to work across a variety of different learning algorithms/architectures (including stuff that might get invented in the future) could be really useful.

Comment by John_Maxwell (John_Maxwell_IV) on Vim · 2021-04-08T22:47:49.001Z · LW · GW

Interesting, thanks for sharing.

I couldn't figure out how to go backwards easily.

Command-shift-g right?

Comment by John_Maxwell (John_Maxwell_IV) on Vim · 2021-04-07T23:34:54.998Z · LW · GW

After practicing Vim for a few months, I timed myself doing the Vim tutorial (vimtutor on the command line) using both Vim with the commands recommended in the tutorial, and a click-and-type editor. The click-and-type editor was significantly faster. Nowadays I just use Vim for the macros, if I want to do a particular operation repeatedly on a file.

I think if you get in the habit of double-clicking to select words and triple-clicking to select lines (triple-click and drag to select blocks of code), click-and-type editors can be pretty fast.

Comment by John_Maxwell (John_Maxwell_IV) on Theory of Knowledge (rationality outreach) · 2021-04-01T04:27:42.153Z · LW · GW

Here is one presentation for young people

Comment by John_Maxwell (John_Maxwell_IV) on Open Problems with Myopia · 2021-03-12T03:20:54.176Z · LW · GW

We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and -10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.

Are you sure that "episode" is the word you're looking for here?

https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL

I'm especially confused because you switched to using the word "timestep" later?

Having an action which modifies the reward on a subsequent episode seems very weird. I don't even see it as being the same agent across different episodes.

Also...

Suppose instead of one button, there are two. One is labeled "STOP," and if pressed, it would end the environment but give the agent +1 reward. The other is labeled "DEFERENCE" and, if pressed, gives the previous episode's agent +10 reward but costs -1 reward for the current agent.

Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.

I think as a practical matter, the result depends entirely on the method you're using to solve the MDP and the rewards that your simulation delivers.

Comment by John_Maxwell (John_Maxwell_IV) on Borasko's Shortform · 2021-03-10T09:13:10.398Z · LW · GW

lsuser had an interesting idea of creating a new Youtube account and explicitly training the recommendation system to recommend particular videos (in his case, music): https://www.lesswrong.com/posts/wQnJ4ZBEbwE9BwCa3/personal-experiment-one-year-without-junk-media

I guess you could also do it for Youtube channels which are informative & entertaining, e.g. CGP Grey and Veritasium. I believe studies have found that laughter tends to be rejuvenating, so optimizing for videos you think are funny is another idea.

Comment by John_Maxwell (John_Maxwell_IV) on Willa's Shortform · 2021-03-10T09:08:35.058Z · LW · GW

I suspect you will be most successful at this if you get in the habit of taking breaks away from your computer when you inevitably start to flag mentally. Some that have worked for me include: going for a walk, talking to friends, taking a nap, reading a magazine, juggling, noodling on a guitar, or just daydreaming.

Comment by John_Maxwell (John_Maxwell_IV) on A Semitechnical Introductory Dialogue on Solomonoff Induction · 2021-03-05T07:00:14.838Z · LW · GW

...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.

ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we're trying to do.

I'm not convinced your chess example, where the practical solution resembles the hypercomputer one, is representative. One way to sort a list using a hypercomputer is to try every possible permutation of the list until we discover one which is sorted. I tend to see Solomonoff induction as being cartoonishly wasteful in a similar way.

Comment by John_Maxwell (John_Maxwell_IV) on Why GPT wants to mesa-optimize & how we might change this · 2021-03-04T20:01:17.970Z · LW · GW

From a safety standpoint, hoping and praying that SGD won't stumble across lookahead doesn't seem very robust, if lookahead represents a way to improve performance. I imagine that whether SGD stumbles across lookahead will end up depending on complicated details of the loss surface that's being traversed.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2021-03-04T09:00:41.587Z · LW · GW

Lately I've been examining the activities I do to relax and how they might be improved. If you haven't given much thought to this topic, Meaningful Rest is excellent background reading.

An interesting source of info for me has been lsusr's posts on cutting out junk media: 1, 2, 3. Although I find lsusr's posts inspiring, I'm not sure I want to pursue the same approach myself. lsusr says: "The harder a medium is to consume (or create, as applicable) the smarter it makes me." They responded to this by cutting all the easy-to-consume media out of their life.

But when I relax, I don't necessarily want to do something hard. I want to do something which rejuvenates me. (See "Meaningful Rest" post linked previously.)

lsusr's example is inspiring in that it seems they got themselves studying things like quantum field theory for fun in their spare time. But they also noted that "my productivity at work remains unchanged", and ended up abandoning the experiment 9 months in "due to multiple changes in my life circumstances". Personally, when I choose to work on something, I usually expect it to be at least 100x as good a use of my time as random productive-seeming stuff like studying quantum field theory. So given a choice, I'd often rather my breaks rejuvenate me a bit more per minute of relaxation, so I can put more time and effort into my 100x tasks, than have the break be slightly useful on its own.

To adopt a different frame... I'm a fan of the wanting/liking/approving framework from this post.

  • In some sense, +wanting breaks are easy to engage in because it doesn't require willpower to get yourself to do them. But +wanting breaks also tend to be compulsive, and that makes them less rejuvenating (example: arguing online).

  • My point above is that I should mostly ignore the +approving or -approving factor in terms of the break's non-rejuvenating, external effects.

  • It seems like the ideal break is +liking, and enough +wanting that it doesn't require willpower to get myself to do it, and once I get started I can disconnect for hours and be totally engrossed, but not so +wanting that I will be tempted to do it when I should be working or keep doing it late into the night. I think playing the game Civilization might actually meet these criteria for me? I'm not as hooked on it as I used to be, but I still find it easy to get engrossed for hours.

Interested to hear if anyone else wants to share their thinking around this or give examples of breaks which meet the above criteria.

Comment by John_Maxwell (John_Maxwell_IV) on Weighted Voting Delenda Est · 2021-03-04T08:55:32.001Z · LW · GW

Good to know! I was thinking the application process would be very transparent and non-demanding, but maybe it's better to ditch it altogether.

Comment by John_Maxwell (John_Maxwell_IV) on John_Maxwell's Shortform · 2021-03-04T08:35:52.171Z · LW · GW

Related to the discussion of weighted voting allegedly facilitating groupthink earlier https://www.lesswrong.com/posts/kxhmiBJs6xBxjEjP7/weighted-voting-delenda-est

An interesting litmus test for groupthink might be: What has LW changed its collective mind about? By that I mean: the topic was discussed on LW, there was a particular position on the issue that was held by the majority of users, new evidence/arguments came in, and now there's a different position which is held by the majority of users. I'm a bit concerned that nothing comes to mind which meets these criteria? I'm not sure it has much to do with weighted voting because I can't think of anything from LW 1.0 either.

Comment by John_Maxwell (John_Maxwell_IV) on Weighted Voting Delenda Est · 2021-03-04T08:30:02.666Z · LW · GW

Makes sense, thanks.

Comment by John_Maxwell (John_Maxwell_IV) on Weighted Voting Delenda Est · 2021-03-03T21:15:47.185Z · LW · GW

For whatever it's worth, I believe I was the first to propose weighted voting on LW, and I've come to agree with Czynski that this is a big downside. Not necessarily enough to outweigh the upsides, and probably insufficient to account for all the things Czynski dislikes about LW, but I'm embarrassed that I didn't foresee it as a potential problem. If I was starting a new forum today, I think I'd experiment with no voting at all -- maybe try achieving quality control by having an application process for new users? Does anyone have thoughts about that?

Comment by John_Maxwell (John_Maxwell_IV) on Takeaways from one year of lockdown · 2021-03-03T20:43:08.190Z · LW · GW

Another possible AI parallel: Some people undergo a positive feedback loop where more despair leads to less creativity, less creativity leads to less problem-solving ability (e.g. P100 thing), less problem-solving ability leads to a belief that the problem is impossible, and a belief that the problem is impossible leads to more despair.

Comment by John_Maxwell (John_Maxwell_IV) on Book review: The Geography of Thought · 2021-03-03T09:27:43.487Z · LW · GW

China's government is more involved to large-scale businesses.

According to the World Economic Forum website:

China is home to 109 corporations listed on the Fortune Global 500 - but only 15% of those are privately owned.

https://www.weforum.org/agenda/2019/05/why-chinas-state-owned-companies-still-have-a-key-role-to-play/

Comment by John_Maxwell (John_Maxwell_IV) on Tournesol, YouTube and AI Risk · 2021-02-13T22:29:38.588Z · LW · GW

Like, maybe depending on the viewer history, the best video to polarize the person is different, and the algorithm could learn that. If you follow that line of reasoning, the system starts to make better and better models of human behavior and how to influence them, without having to "jump out of the system" as you say.

Makes sense.

...there's a lot of content on YouTube about YouTube, so it could become "self-aware" in the sense of understanding the system in which it is embedded.

I think it might be useful to distinguish between being aware of oneself in a literal sense, and the term "self-aware" as it is used colloquially / the connotations the term sneaks in.

Some animals, if put in front of a mirror, will understand that there is some kind of moving animalish thing in front of them. The ones that pass the mirror test are the ones that realize that moving animalish thing is them.

There is a lot of content on YouTube about YouTube, so the system will likely become aware of itself in a literal sense. That's not the same as our colloquial notion of "self-awareness".

IMO, it'd be useful to understand the circumstances under which the first one leads to the second one.

My guess is that it works something like this. In order to survive and reproduce, evolution has endowed most animals with an inborn sense of self, to achieve self-preservation. (This sense of self isn't necessary for cognition--if you trip on psychedelics and experience ego death, your brain can still think. Occasionally people will hurt themselves in this state since their self-preservation instincts aren't functioning as normal.)

Colloquial "self-awareness" occurs when an animal looking in the mirror realizes that the thing in the mirror and its inborn sense of self are actually the same thing. Similar to Benjamin Franklin realizing that lightning and electricity are actually the same thing.

If this story is correct, we need not worry much about the average ML system developing "self-awareness" in the colloquial sense, since we aren't planning to endow it with an inborn sense of self.

That doesn't necessarily mean I think Predict-O-Matic is totally safe. See this post I wrote for instance.

Comment by John_Maxwell (John_Maxwell_IV) on Tournesol, YouTube and AI Risk · 2021-02-13T10:47:42.604Z · LW · GW

I suspect the best way to think about the polarizing political content thing which is going on right now is something like: The algorithm knows that if it recommends some polarizing political stuff, there's some chance you will head down a rabbit hole and watch a bunch more vids. So in terms of maximizing your expected watch time, recommending polarizing political stuff is a good bet. "Jumping out of the system" and noticing that recommending polarizing videos also polarizes society as a whole and gets them to spend more time on Youtube on a macro level seems to require a different sort of reasoning.

For the stock thing, I think it depends on how the system is scored. When training a supervised machine learning model, we score potential models based on how well they predict past data -- data the model itself has no way to affect (except if something really weird is going on?) There doesn't seem to be much incentive to select a model that makes self-fulfilling prophecies. A model which ignores the impact of its "prophecies" will score better, insofar as the prophecy would've affected the outcome.

I'm not necessarily saying there isn't a concern here, I just think step 1 is to characterize the problem precisely.

Comment by John_Maxwell (John_Maxwell_IV) on Making Vaccine · 2021-02-13T03:06:48.860Z · LW · GW

Fixed twitter link

Comment by John_Maxwell_IV on [deleted post] 2021-01-26T00:01:25.724Z

Not sure if this answers, but the book Superforecasting explains, among other things, that probabilistic thinkers tend to make better forecasts.

Comment by John_Maxwell (John_Maxwell_IV) on A few thought on the inner ring · 2021-01-24T22:57:30.149Z · LW · GW

Yes, I didn't say "they are not considering that hypothesis", I am saying "they don't want to consider that hypothesis". Those do indeed imply very different actions. I think one gives very naturally rise to producing counterarguments, the other one does not.

They don't want to consider the hypothesis, and that's why they'll spend a bunch of time carefully considering it and trying to figure out why it is flawed?

In any case... Assuming the Twitter discussion is accurate, some people working on AGI have already thought about the "alignment is hard" position (since those expositions are how they came to work on AGI). But they don't think the "alignment is hard" position is correct -- it would be kinda dumb to work on AGI carelessly if you thought that position is correct. So it seems to be a matter of considering the position and deciding it is incorrect.

I am not really sure what you mean by the second paragraph. AI is being actively regulated, and there are very active lobbying efforts on behalf of the big technology companies, producing large volumes of arguments for why AI is nothing you have to worry about.

That's interesting, but it doesn't seem that any of the arguments they've made have reached LW or the EA Forum -- let me know if I'm wrong. Anyway I think my original point basically stands -- from the perspective of EA cause prioritization, the incentives to dismantle/refute flawed arguments for prioritizing AI safety are pretty diffuse. (True for most EA causes -- I've long maintained that people should be paid to argue for unincentivized positions.)

Comment by John_Maxwell (John_Maxwell_IV) on A few thought on the inner ring · 2021-01-24T08:20:57.322Z · LW · GW

What? What about all the people who prefer to do fun research that builds capabilities and has direct ways to make them rich, without having to consider the hypothesis that maybe they are causing harm?

If they're not considering that hypothesis, that means they're not trying to think of arguments against it. Do we disagree?

I agree if the government was seriously considering regulation of AI, the AI industry would probably lobby against this. But that's not the same question. From a PR perspective, just ignoring critics often seems to be a good strategy.