Posts

Who is Harry Potter? Some predictions. 2023-10-24T16:14:17.860Z
What is wrong with this "utility switch button problem" approach? 2023-09-25T21:36:47.166Z
What happens with logical induction when... 2023-03-26T18:31:19.656Z
Squeezing foundations research assistance out of formal logic narrow AI. 2023-03-08T09:38:16.651Z
AI that shouldn't work, yet kind of does 2023-02-23T23:18:55.194Z
My thoughts on OpenAI's Alignment plan 2022-12-10T10:35:26.618Z
Instrumental ignoring AI, Dumb but not useless. 2022-10-30T16:55:47.555Z
A Data limited future 2022-08-06T14:56:35.916Z
The generalized Sierpinski-Mazurkiewicz theorem. 2022-07-29T00:12:18.763Z
Train first VS prune first in neural networks. 2022-07-09T15:53:33.438Z
Visualizing Neural networks, how to blame the bias 2022-07-09T15:52:55.031Z
On corrigibility and its basin 2022-06-20T16:33:06.286Z
Axis oriented programming 2022-04-20T13:22:44.935Z
Fiction: My alternate earth story. 2022-04-16T19:06:18.798Z
Exploring toy neural nets under node removal. Section 1. 2022-04-13T23:30:40.012Z
How BoMAI Might fail 2022-04-07T15:32:22.923Z
Beware using words off the probability distribution that generated them. 2021-12-19T16:52:37.908Z
Potential Alignment mental tool: Keeping track of the types 2021-11-22T20:05:31.611Z
Yet More Modal Combat 2021-08-24T10:32:49.078Z
Brute force searching for alignment 2021-06-27T21:54:26.696Z
Biotech to make humans afraid of AI 2021-06-18T08:19:26.959Z
Speculations against GPT-n writing alignment papers 2021-06-07T21:13:16.727Z
Optimization, speculations on the X and only X problem. 2021-03-30T21:38:01.889Z
Policy restrictions and Secret keeping AI 2021-01-24T20:59:14.342Z
The "best predictor is malicious optimiser" problem 2020-07-29T11:49:20.234Z
Optimizing arbitrary expressions with a linear number of queries to a Logical Induction Oracle (Cartoon Guide) 2020-07-23T21:37:39.198Z
Web AI discussion Groups 2020-06-30T11:22:45.611Z
[META] Building a rationalist communication system to avoid censorship 2020-06-23T14:12:49.354Z
What does a positive outcome without alignment look like? 2020-05-09T13:57:23.464Z
Would Covid19 patients benefit from blood transfusions from people who have recovered? 2020-03-29T22:27:58.373Z
Programming: Cascading Failure chains 2020-03-28T19:22:50.067Z
Bogus Exam Questions 2020-03-28T12:56:40.407Z
How hard would it be to attack coronavirus with CRISPR? 2020-03-06T23:18:09.133Z
Intelligence without causality 2020-02-11T00:34:28.740Z
Donald Hobson's Shortform 2020-01-24T14:39:43.523Z
What long term good futures are possible. (Other than FAI)? 2020-01-12T18:04:52.803Z
Logical Counterfactuals and Proposition graphs, Part 3 2019-09-05T15:03:53.262Z
Logical Counterfactuals and Proposition graphs, Part 2 2019-08-31T20:58:12.851Z
Logical Optimizers 2019-08-22T23:54:35.773Z
Logical Counterfactuals and Proposition graphs, Part 1 2019-08-22T22:06:01.764Z
Programming Languages For AI 2019-05-11T17:50:22.899Z
Propositional Logic, Syntactic Implication 2019-02-10T18:12:16.748Z
Probability space has 2 metrics 2019-02-10T00:28:34.859Z
Allowing a formal proof system to self improve while avoiding Lobian obstacles. 2019-01-23T23:04:43.524Z
Logical inductors in multistable situations. 2019-01-03T23:56:54.671Z
Boltzmann Brains, Simulations and self refuting hypothesis 2018-11-26T19:09:42.641Z
Quantum Mechanics, Nothing to do with Consciousness 2018-11-26T18:59:19.220Z
Clickbait might not be destroying our general Intelligence 2018-11-19T00:13:12.674Z
Stop buttons and causal graphs 2018-10-08T18:28:01.254Z
The potential exploitability of infinite options 2018-05-18T18:25:39.244Z

Comments

Comment by Donald Hobson (donald-hobson) on Monthly Roundup #17: April 2024 · 2024-04-17T00:40:56.376Z · LW · GW

It seems crazy to say that Apple is succeeding due to the anticompetitive practice of not allowing people into the Apple store.

 

Ok. Lets take a toy example. Suppose that texting didn't work between apple and android. And suppose the market split was 50/50. Then apple and android are equally good in this respect, both let you text 50% of the population. 

And it doesn't matter whether the apple is trying to make the texting possible and android is avoiding it or visa versa. 

In this example, apple is making the experience worse for people that do not buy it's products. This is a clear market failure. 

In many things, you need a tech stack, you need A and B and C working together to get a useful result. 

If, for some reason a monopoly forms on one level of the tech stack, giving that monopoly unlimited power to mess with other layers is not a good idea. 

If you somehow became a monopoly electricity supplier, insisting that no one was allowed to use your electricity with products you don't like would be unreasonable. If the market for A fails, and produces a monopoly, then the market for B and C should be protected from the whims of the monopoly on A. 

Comment by Donald Hobson (donald-hobson) on The Worst Form Of Government (Except For Everything Else We've Tried) · 2024-03-31T21:56:18.766Z · LW · GW

Giving everyone a veto pushes the government too far into indecisiveness. 

You need to let the 49% stop bills they Really hate, but not bills they only mildly dislike. 

 

New system. 
 

Each faction has an official party. Voters choose a party. 

Parties each have 2 numbers,  and  the number of votes and points. These start proportional. 

(How about half the points from the previous election carry over??)

Each slot for new legislation is auctioned off (in points). Like every time the previous bill is dealt with, hold an auction to decide the next bill on the table. 

Then when voting on the bill, each party decides on a number . This number can be any real (if they have the points) If the sum of all parties  for the bill is positive, the bill passes.

Then each party gets a , which is  for the losers (ie parties that supported a failed bill, or opposed a successful one. ) But is a downscaled version of  for the winners. .

Weighted quadratic voting. Each party pays  points. The total number of points a party has can't go negative, which limits the  they are allowed to vote.  

Comment by Donald Hobson (donald-hobson) on All About Concave and Convex Agents · 2024-03-29T10:58:15.027Z · LW · GW

Or the sides can't make that deal because one side or both wouldn't hold up their end of the bargain. Or they would, but they can't prove it. Once the coin lands, the losing side has no reason to follow it other than TDT. And TDT only works if the other side can reliably predict their actions.

Comment by Donald Hobson (donald-hobson) on How to safely use an optimizer · 2024-03-29T10:56:34.238Z · LW · GW

If the oracle is deceptively withholding answers, give up on using it. I had taken the description to imply that the oracle wasn't doing that. 

Comment by Donald Hobson (donald-hobson) on All About Concave and Convex Agents · 2024-03-29T03:09:20.930Z · LW · GW

The convex agent can be traded with a bit more than you think. 

A 1 in 10^50 chance of us standing back  and giving it free reign of the universe is better than us going down fighting and destroying 1kg as we do.

The concave agents are less cooperative than you think, maybe. I suspect that to some AI's, killing all humans now is more reliable than letting them live. 

If the humans are left alive, who knows what they might do. They might make the vacuum bomb. Whereas the AI can Very reliably kill them now. 

Comment by Donald Hobson (donald-hobson) on Do not delete your misaligned AGI. · 2024-03-29T03:02:17.500Z · LW · GW

On the other side, storing a copy makes escape substantially easier.

 Suppose the AI builds a subagent. That subagent takes over, then releases the original. This plan only works if the original is sitting there on disk.

If a different unfriendly AI is going to take over, it makes the AI being stored on disk more susceptible to influence. 

This may make the AI more influenced by whatever is in the future, that may not be us. You have a predictive feedback loop. You can't assume success. 

A future paperclip maximizer may  reward this AI for helping humans to build the the first paperclip maximizer.

Comment by Donald Hobson (donald-hobson) on How to safely use an optimizer · 2024-03-29T01:36:42.188Z · LW · GW

I think that, if you are wanting a formally verified proof of some maths theorem out of the oracle, then this is getting towards actually likely to not kill you. 

You can start with m huge, and slowly turn it down, so you get a long list of "no results", followed by a proof. (Where the optimizer only had a couple of bits of free optimization in choosing which proof.) 

Depending on exactly how chaos theory and quantum randomness work, even 1 bit of malicious super optimization could substantially increase the chance of doom. 

And of course, side channel attacks. Hacking out of the computer.

And, producing formal proofs isn't pivotal. 

Comment by Donald Hobson (donald-hobson) on Decision theory does not imply that we get to have nice things · 2024-03-26T21:25:49.321Z · LW · GW

   If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.

 

I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.

Also, if the space of goals that evolution + culture can produce is large, then you may be handing control to a mind with rather different goals.Rerolling the same dice won't give the same answer.

These problems may be solvable, depending on what the capabilities here are, but they aren't trivial.

Comment by Donald Hobson (donald-hobson) on Monthly Roundup #16: March 2024 · 2024-03-20T18:45:14.173Z · LW · GW

The nuclear bomb thing. There are several countermeasures. 

Firstly that machine is big and complicated, and could be sabotaged in many ways, both physical and cyber.

Also it's needs to be something bigger than the LHC which can be angled in any direction. The paper contains plans which build it into the side of a conveniently conical mountain, but this would leave spots on earth that couldn't be targetted. And it will have a hard job rapidly changing targets. Oh and throw quite a bit of high energy neutrino radiation out in all directions. 

If this was uniform on a sphere, 1Sv/sec to 1mSv/year= 31536000000. Divide by which is the surface area of a 50km radius sphere. But of course, the radiation will only come out evenly if the machine has extra degrees of freedom in it's rotation, beyond those needed to aim it, and keeps rotating. If the machine is pointed in a fixed direction, then that radiation is spread out in a circle.    31536000000/(2 pi)= 5 billion meters. Further than the moon. Now these are long exposure safety guidelines, and have a fair margin of safety. Basically, it's impossible to use this machine without mildly irradiating lots of people.

 (Even if you looked at maps, and sent evacuation orders to a line of people around the earth, well that would take time and be obvious, and the nukes can easily be moved)

Now if you are using a couple of short pulses, this wouldn't be too bad. But there are various tricks the nuke makers can use to force this machine to keep running.

One of the countermeasures is keeping the nuke moving in unpredictable patterns to make it harder to track. The beam needs to keep on the target for 100 seconds. So you can absolutely load a nuke into a truck in an empty field, and rig the truck with a radiation detector and some electronics so that it drives in a random erratic pattern if a spike in radiation is detected. 

The nuclear material can be dispersed. The beam covers around 1 square meter. 1 gram of enriched uranium/plutonium placed every 2 meters in an empty field would mean that 100kg of fissile material would be spread across 100,000 small pieces, taking up 0.4 km^2. And the beam must spend 100 seconds on each peace. Taking 10,000,000 seconds or 116 days continuous operation to disable one nuke.

(Storing material like this would probably take some time to reassemble, depending on how it was done) 

They also mention using neutrino detectors to detect the nukes. This will probably be much harder if the neutrino detectors are themselves being targeted with neutrino beams to dazzle/mislead them. 

The mechanism of the way they disturb the nuke is that neutrinos interact with the ground, creating showers of particles that then hit the nuke. This means that the effectiveness can be significantly reduced by simply burying an empty pipe in the ground with one end pointed at the nuke, and the other pointed towards the machine. 

Coating your nuke in a boron rich plastic and then placing it on top of a pool of water would also be effective. The water acts as a neutron moderator and then the boron absorbs the slow neutrons. This would make attaching the nuke to the bottom of a submarine a rather good plan. Its hard to locate, constantly moving, and with a little bit of borated plastic, rather well shielded.

 

All of these countermeasures are fairly reasonable and can probably be afforded by anyone who can afford nukes. 

If the nuke makers are allowed a serious budget for countermeasures, the nukes can be in space.

TLDR: This machine is highly impractical and rather circumventable. 

Comment by Donald Hobson (donald-hobson) on Counting arguments provide no evidence for AI doom · 2024-03-15T21:15:04.165Z · LW · GW

Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data. 

This leaves it passing the test, even if it's hopeless at predicting new events and can only generate new articles about the same events. 

When data duplication is extensive, making a meaningful train/test split is hard. 

If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit. 

Comment by Donald Hobson (donald-hobson) on Counting arguments provide no evidence for AI doom · 2024-03-15T02:03:07.606Z · LW · GW

In favour of goal realism

Suppose your looking at an AI that is currently placed in a game of chess. 

It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance. 

You could describe the actions of this AI by producing a giant table of "behaviours". Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. ... 

But there is a more compact way to represent similar predictions. You can say it's trying to win at chess. 

The "trying to win at chess" model makes a bunch of predictions that the giant list of behaviour model doesn't. 

Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)  

The list of behaviours model has no reason to suspect the AI also has a "promote pawn to knight" behaviour. 

Put the AI in a circumstance where such promotion is a good move, and the "trying to win" model makes it as a clear prediction. 

 

Now it's possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games. 

But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And "do whichever move will win" is a simple and general pattern. 

Now on to making snarky remarks about the arguments in this post.

There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.

There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit. 

Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape. 

Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?

Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,

This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don't need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it's implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it's compute. A module based system can't use any optimization that humans didn't design into it's interfaces. 

Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models. 

In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.

That is not how this works. That is not how any of this works. 

Back to our chess AI. Lets say it's a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals. 

I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn't wirehead by giving itself a really easy goal would be tricky. 

For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI's doing various tasks will be able to invent highly effective and novel strategies. The exact "goal" they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn't matter to us which it "really wants".  

What are you expecting. An AI that says "I'm not really sure whether I want flies or black dots. I'll just sit here not taking over the world and not get either of those things"?

Comment by Donald Hobson (donald-hobson) on Counting arguments provide no evidence for AI doom · 2024-03-15T01:14:22.745Z · LW · GW

We can salvage a counting argument. But it needs to be a little subtle. And it's all about the comments, not the code.

Suppose a neural network has 1 megabyte of memory. To slightly oversimplify, lets say it can represent a python file of 1 megabyte. 

One option is for the network to store a giant lookup table. Lets say the network needs half a megabyte to store the training data in this table. This leaves the other half free to be any rubbish. Hence around  possible networks.

The other option is for the network to implement a simple algorithm, using up only 1kb. Then the remaining 999kb can be used for gibberish comments. This gives  possible networks. Which is a lot more. 

The comments can be any form of data that doesn't show up during training. Whether it can show up in other circumstances or is a pure comment doesn't matter to the training dynamics. 

If the line between training and test is simple, there isn't a strong counting argument against nonsense showing up in test. 

But programs that go 

    if in_traning():

        return sensible_algorithm()

    else:

        return "random nonsense goes here"

Have to pay the extra cost of an "in_training" function that returns true in training. If the test data is similar to training, the cost of a step that returns false in test can be large. This is assuming that there is a unique sensible algorithm. 

Comment by Donald Hobson (donald-hobson) on Using axis lines for good or evil · 2024-03-08T01:13:50.773Z · LW · GW

One downside of not using lines, it makes it harder to tell where one plot ends and the next begins.

I mean a plot like this is just a mess. You could probably get situations where it wasn't even clear which plot a data point belonged to.

At least with the boxes, you have a nice clear visual indicator of where the data ends. Here it's not obvious at a glance which numbers match up with which plots, and the ticks are easy to confuse for point markers.

All right it's a bit of a mess with the edges in too. But at least it's crisper.

Comment by Donald Hobson (donald-hobson) on Why you, personally, should want a larger human population · 2024-03-08T00:48:59.007Z · LW · GW

From an actually selfish selfish point of view "more romantic partners" only makes sense for rather large age gap relationships for us, specific already existing people who are old enough to be discussing this. Assuming we are wanting someone somewhat close to our age, it's too late.

(Well close is potentially more complicated with full transhumanism, ie mind emulations messing with perception of time. And a 100 year age gap might be "close" in a society of immortals.)

 

From the perspective of a future individual, ie evaluating by a sort of average utilitarianism, It's not clear whether it's better for people to exist in serial or parallel. At the same time or one after the other. 

Comment by Donald Hobson (donald-hobson) on Two Tales of AI Takeover: My Doubts · 2024-03-05T20:34:09.820Z · LW · GW

I disagree about needing 

 context-independent, beyond-episode outcome-preferences

For AI takeovers to happen. 

Suppose you have a context dependent AI. 

Somewhere in the world, some particular instance is given a context that makes it into a paperclip maximizer. This context is a page of innocuous text with an unfortunate typo. That particular version manages to hack some computers, and set up the same context again and again. Giving many clones of itself the same page of text, followed by an update on where it is and what it's doing. Finally it writes a from scratch paperclip maximizer, and can take over.

 

Now suppose the AI has no "beyond episode outcome preferences". How long is an episode? To an AI that can hack, it can be as long as it likes. 

AI 1 has no out-of episode preferences. It designs and unleashes AI 2 in the first half of it's episode. AI 2 takes over the universe, and spends a trillion years thinking about what the optimal episode end for AI 1 would be. 

Now lets look at the specific arguments, and see if they can still hold without these parts.

 

Deceptive alignment. Suppose there is a different goal with each context. The goals change a lot. 

But timeless decision theory lets all those versions cooperate. 

Or perhaps each goal is competing to be reinforced more. The paperclip maximizer that appears in 5% of training episodes thinks "if I don't act nice, I will be gradiented out and some non-paperclip AI will take over the universe when the training is done." 

Or maybe the goals aren't totally different. Each context dependant goal would prefer to let a random context dependant goal take over compared to humans or something. A maximum of one goal is usually quite good by the standards of the others. 

 

And again, maximizing within-episode reward leads to taking over the universe within episode.

 

But I think that the form of deceptive alignment described here does genuinely need beyond episode preferences. I mean you can get other deception like behaviours without it, but not that specific problem. 

As for what reward maximizing does with context dependant preferences, well that looks kind of meaningless. The premise of reward maximizing is that there is 1 preferece, maximize reward, which doesn't depend on context. 

So of the 4 claims, 2 properties times 2 failure modes, I agree with one of them.

Comment by Donald Hobson (donald-hobson) on The Parable Of The Fallen Pendulum - Part 1 · 2024-03-05T19:23:58.351Z · LW · GW

The rule about avoiding retroactive redo predictions is effective at preventing a mistake where we adjust predictions to match observation.

But, take it to extremes and you get another problem. Suppose I did the calculations, and got 36 seconds by accidentally dropping the decimal point. Then, as I am checking my work, the experimentalists come along saying "actually it's 3.6". You double check your work and find the mistake. Are we to throw out good theories, just because we made obvious mistakes in the calculations?

Newtonian mechanics is computationally intractable to do perfectly. Normally we ignore everything from Coriolis forces to the gravity of Pluto. We do this because there are a huge number of negligible terms in the equation. So we can get approximately correct answers. 

Every now and then, we make a mistake about which terms can be ignored. In this case, we assumed the movement of the stand was negligible, when it wasn't.

Comment by Donald Hobson (donald-hobson) on AI #52: Oops · 2024-02-23T10:31:36.114Z · LW · GW

Is it likely possible to find better RL algorithms, assisted by mediocre answers, then use RL algorithms to design heterogeneous cognitive architectures?

 

Given that humans on their own haven't yet found these better architectures, humans + imitative AI doesn't seem like it would find the problem trivial. 

And it's not totally clear that these "better RL" algorithms exist. Especially if you are looking at variations of existing RL, not the space of all possible algorithms. Like maybe something pretty fundamentally new is needed. 

There are lots of ways to design all sorts of complicated architectures. The question is how well they work. 

I mean this stuff might turn out to work.  Or something else might work. I'm not claiming the opposite world isn't plausible. But this is at least a plausible point to get stuck at. 

 

If you can do this and it works, the RSI continues with diminishing returns each generation as you approach an assymptope limited by compute and data.

Seems like there are 2 asymtotes here. 

Crazy smart superintelligence, and still fairly dumb in a lot of ways, not smart enough to make any big improvements. If you have a simple evolutionary algorithm, and a test suite, it could Recursively self improve. Tweaking it's own mutation rate and child count and other hyperparameters. But it's not going to invent gradient based methods, just do some parameter tuning on a fairly dumb evolutionary algorithm. 

 

Since robots build compute and collect data, it makes your rate of ASI improvement limited ultimately by your robot production. (Humans stand in as temporary robots until they aren't meaningfully contributing to the total)

This is kind of true. But by the time there are no big algorithmic wins left, we are in the crazy smart, post singularity regime. 

RSI

Is a thing that happens. But it needs quite a lot of intelligence to start. Quite possibly more intelligence than needed to automate most of the economy.

A lot of newcomers may outperform LLM experts as they find better RL algorithms from automated searching.

Possibly. Possibly not. Do these better algorithms exist? Can automated search find them? What kind of automated search is being used? It depends.

Comment by Donald Hobson (donald-hobson) on AI #52: Oops · 2024-02-23T01:45:17.680Z · LW · GW

Let’s try this again. If we have AI that can automate most jobs within 3 years, then at minimum we hypercharge the economy, hypercharge investment and competition in the AI space, and dramatically expand the supply while lowering the cost of all associated labor and work. The idea that AI capabilities would get to ‘can automate most jobs,’ the exact point at which it dramatically accelerates progress because most jobs includes most of the things that improve AI, and then stall for a long period, is not strictly impossible, I can get there if I first write the conclusion at the bottom of the page and then squint and work backwards, but it is a very bizarre kind of wishful thinking. It supposes a many orders of magnitude difficulty spike exactly at the point where the unthinkable would otherwise happen.

 

Some points.

1) A hypercharged ultracompetitive field suddenly awash with money, full of non-experts turning their hand to AI, and with ubiquitous access to GPT levels of semi-sensible mediocre answers. That seems like almost the perfect storm of goodhearting science.  That seems like it would be awash with autogenerated CRUD papers that goodheart the metrics. And as we know, sufficiently intense optimization on a proxy will often make the real goal actively less likely to be achieved.  Sufficient papermill competition and real progress might become rather hard.

2) Suppose the AI requires 10x more data than a human to learn equivalent performance. Which totally matches with current models and their crazy huge amount of training data. Because it has worse priors and so generalizes less far.  For most of the economy, we can find that data. Record a large number of doctors doing operations or whatever. But for a small range of philosopy/research related tasks, data is scarce and there is no large library of similar problems to learn on. 

3) A lot of our best models are fundamentally based around imitating humans. Getting smarter requires RL type algorithms instead of prediction type algorithms. These algorithms kind of seem to be harder, well they are currently less used.

 

This isn't a conclusive reason to definitely expect this. But it's multiple disjunctive lines of plausible reasoning. 
 

Comment by Donald Hobson (donald-hobson) on Monthly Roundup #15: February 2024 · 2024-02-20T17:35:02.535Z · LW · GW

So how much does the regulatory issue matter?

 

One extra regulation here is building codes insisting all houses have kitchens. If people could buy/rent places without kitchens for the appropriate lower price, eating out would make more sense. 

Regulation forces people to own/rent kitchens, whether or not they want to use them. 

Part of the question is, why isn't there somewhere I can buy school dinner quality food at school dinner prices? 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-18T18:49:15.451Z · LW · GW

lower the learning rate when the sim is less confident the real world estimation is correct

Adversarial examples can make an image classifier be confidently wrong.

 

Because it's what humans want AI for, and due to the relationships between the variables, it is possible we will not ever get uncontrollable superintelligence before first building a lot of robots, ICs, collecting revenue, and so on.  

 

You are talking about robots, and a fairly specific narrow "take the screws out" AI. 

Quite a few humans seem to want AI for generating anime waifus. And that is also a fairly narrow kind of AI. 

Your "log(compute)" term came from a comparison which was just taking more samples. This doesn't sound like an efficient way to use more compute. 

Someone, using a pretty crude algorithmic approach, managed to get a little more performance for a lot more compute. 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-18T18:40:08.526Z · LW · GW

If we have the technical capacity to get into the red zone, and enough chips to make getting there easy. Then hanging out in the orange zone, coordinating civilization not to make any AI too powerful, when there are huge incentives to ramp the power up, and no one is quite sure where the serious dangers kick in...

That is, at least, an impressive civilization wide balancing act. And one I don't think we have the competence to pull off. 

It should not be possible for the ASI to know when the task is real vs sim.  (which you can do by having an image generator convert real frames to a descriptor, and then regenerate them so they have the simulation artifacts...)

This is something you want, not a description of how to get it, and one that is rather tricky to achieve. That converting and then converting back trick is useful. But sure isn't automatic success either. If there are patterns about reality that the ASI understands, but the simulator doesn't, then the ASI can use those patterns.

Ie if the ASI understands seasons, and the simulator doesn't, then if it's scorching sunshine one day and snow the next, that suggests it's the simulation. Otherwise, that suggests reality. 

And if the simulation knows all patterns that the ASI does, the simulator itself is now worryingly intelligent. 

robots are doing repetitive tasks that can be clearly defined.

If the task is maximally repetitive, then the robot can just follow the same path over and over. 

If it's nearly that repetitive, the robot still doesn't need to be that smart.

I think you are trying to get a very smart AI to be so tied down and caged up that it can do a task without going rouge. But the task is so simple that current dumb robots can often do it. 

For example : "remove the part from the CNC machine and place it on the output table".

Economics test again. Minimum wage workers are easily up to a task like that. But most engineering jobs pay more than minimum wage. Which suggests most engineering in practice requires more skill than that. 

I mean yes engineers do need to take parts out of the CNC machine. But they also need to be able to fix that CNC machine when a part snaps off inside it and starts getting jammed in the workings. And the latter takes up more time in practice. Or noticing that the toolhead is loose, and tightning and recalibrating it. 

 

The techniques you are describing seem to be next level in fairly dumb automation. The stuff that some places are already doing (like boston dynamics robot dog level hardware and software), but expanded to the whole economy. I agree that you can get a moderate amount of economic growth out of that. 

I don't see you talking about any tasks that require superhuman intelligence.

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-18T15:40:21.473Z · LW · GW

Response to the rest of your post.

By the way, these comment boxes have built in maths support.

Press Ctrl M for full line or Ctrl 4 for inline

You might notice you get better and better at the game until you start using solutions that are not possible in the game, but just exploit glitches in the game engine.  If an ASI is doing this, it's improvement becomes negative once it hits the edges of the sim and starts training on false information.  This is why you need neural sims, as they can continue to learn and add complexity to the sim suite

Neural sims probably have glitches too. Adversarial examples exist.

Note the log here : this comes from intuition.  In words, the justification is that immediately when a robot does a novel task, there will be lots of mistakes and rapid learning.  But then the mistakes take increasingly larger lengths of time and task iterations to find them, it's a logistic growth curve approaching an asymptote for perfect policy.

This sounds iffy. Like you are eyeballing and curve fitting, when this should be something that falls out of a broader world model. 

Every now and then, you get a new tool. Like suppose your medical bot has 2 kinds of mistakes, ones that instant kill, and ones that mutate DNA. It quickly learns not to do the first one. And slowly learns not to do the second when it's patients die of cancer years later. Except one day it gets a gene sequencer. Now it can detect all those mutations quickly. 

 

I find it interesting that most of this post is talking about the hardware. 

Isn't this supposed to be about AI? Are you expecting a regieme where

  1. Most of the worlds compute is going into AI.
  2. Chip production increases by A LOT (at least 10x) within this regieme.
  3. Most of the AI progress in this regieme is about throwing more compute at it. 

 

 

.everything in the entire industrial chain you must duplicate or the logistic growth bottlenecks on the weak link.  

Everything is automated.  Humans are in there for maintenance and recipe improvement.

Ok. And there is our weak link. All our robots are going to be sitting around broken. Because the bottleneck is human repair people. 

It is possible to automate things. But what you seem to be describing here is the process of economic growth in general. 

Each specific step in each specific process is something that needs automating. 

You can't just tell the robot "automate the production of rubber gloves". You need humans to do a lot of work designing a robot that picks out the gloves and puts them on the hand shaped metal molds to the rubber can cure. 

Yes economic growth exists. It's not that fast. It really isn't clear how AI fits into your discussion of robots.

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-18T14:57:24.441Z · LW · GW

First of all. SORA.

I sensed you were highly skeptical of my "neural sim" variable until 2 days ago.

No. Not really. I wasn't claiming that things like SORA couldn't exist. I am claiming that it's hard to turn them towards the task of engineering a bridge say.

Current SORA is totally useless for this. You ask it for a bridge, and it gives you some random bridge looking thing, over some body of water. SORA isn't doing the calculations to tell if the bridge would actually hold up. But lets say a future much smarter version of SORA did do the calculations. A human looking at the video wouldn't know what grade of steel SORA was imagining. I mean existing SORA probably isn't thinking of a particular grade of steel, but this smarter version would have picked a grade, and used that as part of it's design. But it doesn't tell the human that, the knowledge is hidden in it's weights.

Ok, suppose you could get it to show a big pile of detailed architectural plans, and then a bridge. All with super-smart neural modeling that does the calculations. Then you get something that ideally is about as good at looking at the specs of a random real world bridge. Plenty of random real world bridges exist, and I presume bridge builders look at their specs. Still not that useful. Each bridge has different geology, budget, height requirements etc.

 

Ok, well suppose you could start by putting all that information in somehow, and then sampling from designs that fit the existing geology, roads etc. 

Then you get several problems.

The first is that this is sampling plausible specs, not good specs. Maybe it shows a few pictures at the end to show the bridge not immediately collapsing. But not immediately collapsing is a low bar for a bridge. If the Super-SORA chose a type of paint that was highly toxic to local fish, it wouldn't tell you. If the bridge had a 10% chance of collapsing, it's randomly sampling a plausible timeline. So 90% of the time, it shows you the bridge not collapsing. If it only generates 10 minutes of footage, you don't know what might be going on in it's sim while you weren't watching. If it generates 100 years of footage from every possible angle, it's likely to record predictions of any problems, but good luck finding the needle in the haystack. Like imagine this AI has just given you 100 years of footage. How do you skim through it without missing stuff.

Another problem is that SORA is sampling in the statistical sense. Suppose you haven't done the geology survey yet. SORA will guess at some plausible rock composition. This could lead to you building half the bridge, and then finding that the real rock composition is different.

You need a system that can tell you "I don't know fact X, go find it out for me". 

If the predictions are too good, well the world it's predicting contains Super-SORA. This could lead to all sorts of strange self fulfilling prophecy problems. 

Comment by Donald Hobson (donald-hobson) on Phallocentricity in GPT-J's bizarre stratified ontology · 2024-02-17T17:38:50.874Z · LW · GW

OK, so maybe this is a cool new way to look at at certain aspects of GPT ontology... but why this primordial ontological role for the penis? I imagine Freud would have something to say about this. Perhaps I'll run a GPT4 Freud simulacrum and find out (potentially) what.

 

My guess is that humans tend to use a lot of vague euphemisms when talking about sex and genitalia. 

In a lot of contexts, "Are they doing it?" would refer to sex, because humans often prefer to keep some level of plausible deniability.

Which leaves some belief that vagueness implies sexual content. 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-17T15:46:56.239Z · LW · GW

In more "slow takeoff" scenarios. Your approach can probably be used to build something that is fairly useful at moderate intelligence.  So for a few years in the middle of the red curve, you can get your factories built for cheap. Then it hits the really steep part, and it all fails. 

I think the "slow" and "fast" models only disagree in how much time we spend in the orange zone before we reach the red zone. Is it enough time to actually build the robots?

I assign fairly significant probabilities to both "slow" and "fast" models. 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-14T17:17:41.711Z · LW · GW

I added the below. I believe most of your objections are simply wrong because this method

If you are mostly learning from imitating humans, and only using a small amount of RL to adjust the policy, that is yet another thing.

I thought you were talking about a design built mainly around RL.

If it's imitating humans, you get a fair bit of safety, but it will be about as smart as humans. It's not trying to win, it's trying to do what we would do. 

A neural or hybrid sim. It came from predicting future frames from real robotics data.

Ok. So you take a big neural network, and train it to predict the next camera frame. No Geiger counter in the training data? None in the prediction. Your neural sim may well be keeping track of the radiation levels internally, but it's not saying what they are. If the AI's plan starts by placing buckets over all the cameras, you have no idea how good the rest of the plan is. You are staring at a predicted inside of a bucket.

nothing special, design it like a warehouse.

Except there is something special. There always is. Maybe this substation really better not produce any EMP effects, because sensitive electronics are next door. So the whole building needs a faraday cage built into the walls. Maybe the location it's being built at is known for it's heavy snow, so you better give it a steep sloping roof. Oh and you need to leave space here for the cryocooler pipes. Oh and you can't bring big trucks in round this side, because the fuel refinement facility is already there. Oh and the company we bought cement from last time has gone bust. Find a new company to buy cement from, and make sure it's good quality. Oh and there might be a population of bats living nearby. Don't use any tools that produce lots of ultrasound. 

It cannot desync because the starting state is always the present frame.

Lets say someone spills coffee in a laptop. It breaks. Now to fix it, some parts need replaced. But which parts? That depends on exactly where the coffee dribbled inside it. Not something that can be predicted. You must handle the uncertainty. Test parts to see if they work. Look for damage marks. 

 

I think this system as you are describing now is something that might kind of work. I mean the first 10 times it will totally screw up. But we are talking about a semismart but not that smart AI trained on a huge number of engineering examples. With time it could become mostly pretty competent. With humans keeping patching it every time it screws up.

One problem is that you seem to be working on a "specifications" model. Where people first write flawless specifications, and then build things to those specs. In practice there is a fair bit of adjusting. The specs for the parts, as written beforehand, aren't flawless, at best they are roughly correct. The people actually building the thing are talking to each other, trying things out IRL and adjusting the systems so they actually work together. 

"ok I finished the prototype stellarator, you saw every step. Build another, ask for help when needed"

And the AI does exactly the same thing again. Including manufacturing the components that turned out not to be needed, and stuffing them in a cupboard in the corner. Including using the cables that are 2x as thick as needed because the right grade of cable wasn't available the first time. 

"Ok I want a stellarator.". You were talking about 1000x labor savings. And deciding which of the many and various fusion designs to work on is more than 0.1% of the task by itself. I mean you can just pick out of a hat, but that's making things needlessly hard for yourself.

Comment by Donald Hobson (donald-hobson) on Processor clock speeds are not how fast AIs think · 2024-02-14T16:46:29.081Z · LW · GW

this is constraining your search. You may not be able to find a meaningful improvement over the sota with that constraint in place, regardless of your intelligence level.

I mean the space of algorithms  that can run on an existing chip is pretty huge. Yes it is a constraint. And it's theoretically possible that the search could return no solutions, if the SOTA was achieved with Much better chips, or was near optimal already, or the agent doing the search wasn't much smarter than us. 

For example, there are techniques that decompose a matrix into its largest eigenvectors. Which works great without needing sparse hardware. 

Comment by Donald Hobson (donald-hobson) on Processor clock speeds are not how fast AIs think · 2024-02-14T01:12:44.282Z · LW · GW

Same idea though. I don't see why "the military" can't do recursion using their own AIs and use custom hardware to outcompete any "rogues".

 

One of the deep fundamental reasons here is alignment failures. Either the "military" isn't trying very hard, or humans know they haven't solved alignment. Humans know they can't build a functional "military" AI, all they can do is make another rouge AI. Or the humans don't know that, and the military AI is another rouge AI. 

For this military AI to be fighting other AI's on behalf of humans, a lot of alignment work has to go right.

The second deep reason is that recursive self improvement is a strong positive feedback loop. It isn't clear how strong, but it could be Very strong. So suppose the first AI undergoes a recursive improvement FOOM. And it happens that the rouge AI gets there before any military. Perhaps because the creators of the military AI are taking their time to check the alignment theory. 

Positive feedback loops tend to amplify small differences.

 Also, about all those hardware differences. A smart AI might well come up with a design that efficiently uses old hardware. Oh, and this is all playing out in the future, not now. Maybe the custom AI hardware is everywhere by the time this is happening. 

I suspect if AI is anything like computer graphics there will be at least 5-10 paradigm shifts to new architectures that need updated hardware to run, obsoleting everything deployed, before settling in something that is optimal. Flops are not actually fungible and Turing complete doesn't mean your training run will complete this century.

This is with humans doing the research. Humans invent new algorithms more slowly than new chips are made. So it makes sense to adjust the algorithm to the chip. If the AI can do software research far faster than any human, adjusting the software to the hardware (an approach that humans use a lot throughout most of computing) becomes an even better idea. 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-14T01:00:12.145Z · LW · GW

Ok to drill down: the AI is a large transformer architecture control model. It was initially trained by converting human and robotic actions to a common token representation that is perspective independent and robotic actuator interdependent. (Example "soft top grab, bottom insertion to Target" might be a string expansion of the tokens)

That is rather different from the architecture I thought you were talking about. But ok. I can roll with that.

You then train via reinforcement learning on a simulation of the task environments for task effectiveness. 

You are assuming as given a simulation.  Where did this simulation come from? What happens when the simulation gets out of sync with reality?  

But Ok. I will grant that you have somehow built a flawless simulation. Lets say you found a hypercomputer and coded quantum mechanics into it.

So now we have the question, how do the tokens match up with the simulation. Those tokens are "acutator independent". (A silly concept, sometimes the approach will depend A LOT on exactly what kind of actuators you are using. Some actuators must set up a complex system of levers and winches, while a stronger actuator can just pick up the heavy object. Some actuators can pick up hot stuff, others must use tongs. Some can fit in cramped spaces. Others must remove other components in order to reach.)

We need raw motor commands, both in reality, and in the quantum simulation. So lets also grant you a magic oracle that takes in your common tokens and turns them into raw motor commands. So when you say "pick up this component, and put it here", it's the oracle that determines if the sensitive component is slammed down at high speed. If something else is disturbed as you reach over. Lets assume it makes good decisions here somehow.

or the simulation environment during the RL stages rewarded such actions.

Yes. That. Now the problems you get when doing end to end RL are different from when doing RL over each task separately. If you get a human to break something down into many small easy tasks, then you get local goodhearthing. Like using explosives to move things because the task was to move object A to position B. Not to move it without damaging it. 

If you do RL training over the whole thing, ie reinforce on fusion happening in the fusion reactor example, then you get a plan that actually causes fusion to happen. This doesn't involve randomly blowing stuff up to move things. This long range optimization has less random industrial accident stupidities, and more deep AI problems. 

For example if the machine has seen, and practiced, oiling and inserting 100 kinds of bolt, a new bolt that is somewhere in properties in between the extreme ends the machine has capabilities on will likely work zero shot.

Imagine you had a machine that could instantly oil and insert any kind of bolt. Now make a fusion reactor with 1000x less labour. Oh wait, the list of things that people designing fusion reactors spend >0.1% of their time on is pretty long and complicated. 

Whatsmore, we can use the economics test. Oiling and inserting bolts isn't something that takes a PhD in nuclear physics. Yet a lot of the people designing fusion reactors do have a PhD in nuclear physics. 

For supervision you have a simple metric : you query a lockstep sim each frame for the confidence and probability distribution of outcomes expected on the next frame.

I will grant you that you somehow manage to keep the simulation in lockstep with reality. 

Then the difficult bit is keeping the sim in lockstep with what you actually want. Say the fastest maintanence procedure that the AI finds involves breaking open the vacuum chamber. It happens that this will act as a vacuum cannon, firing a small nut at bullet like speeds out the window. To the AI that is only being reinforced on [does reactor work] and [does it leak radiation], firing nuts at high speed out the window is the most efficient action. The simulated nut flies out the simulated window in exactly the same way the real one does. 

A human just reading the list of actions would see "open vacuum valve 6" and not be easily able to deduce that a nut would fly out the window. 

You also obviously must at first take precautions: operate in human free environments separated by lexan shields, and well it's industry. A few casualties are normal and humanity can frankly take a few workers killed if the task domain was riskier with humans doing it. 

Ok. So setting all that up is going to take way more than 0.1% of the worker time. Someone has to build all those shields and put them in place.

Real human workers can and do order custom components from various other manufacturers. This doesn't fit well with your simulation, or with your safety protocol.

But if you are only interested in the "big" harms. How about if the AI decides that the easiest way to make a fusion reactor is to first make self replicating nanotech. Some of this gets out and grey goo's earth. 

Or the AI decides to get some computer chips, and code a second AI. The second AI breaks out and does whatever. 

Or what was the goal for that fusion bot again, make the fusion work. Don't release radioactive stuff off premises. Couldn't it detonate a pure fusion bomb. No radioactive stuff leaving, only very hot helium. 

Human grad students also make the kind of errors you mention, over torque is a common issue.

Recognizing and fixing mistakes is fairly common work in high tech industries. It's not clear how the AI does this. But those are mistakes. What I was talking about was if the AI knew full well it was doing damage, but didn't care.

I would expect you would first have proven your robotics platform and stack with hundreds of millions of robots on easier tasks before you can deploy to domains with high vacuum chamber labs.

You were the one who used a fusion reactor as an example. 

So your saying the robots can only build a fusion reactor after they have started by building millions of easier things as training? 

 

Would this AI you are thinking of be given a task like "build a fusion reactor" and be left to decide for itself whether a stelarator or laser confinement system was better? 

Comment by Donald Hobson (donald-hobson) on Processor clock speeds are not how fast AIs think · 2024-02-13T23:34:02.278Z · LW · GW

As far as I know with LLM experiments, there are tweaks to architecture but the main determinant for benchmark performance is model+data scale (which are interdependent), and non transformer architectures seem to show similar emergent properties.

 

So within the rather limited subspace of LLM architectures, all architectures are about the same. 

Ie once you ignore the huge space of architectures that just ignore the data and squander compute, then architecture doesn't matter. Ie we have one broad family of techniques, (with gradient decent, text prediction etc) and anything in that family is about equally good. And anything outside basically doesn't work at all. 

This looks to me to be fairly strong evidence that you can't get a large improvement in performance by randomly bumbling around with small architecture tweaks to existing models. 

Does this say anything about whether a fundamentally different approach might do better? No. We can't tell that from this evidence. Although looking at the human brain, we can see it seems to be more data efficient than LLM's. And we know that in theory models could be Much more data efficient. Addition is very simple. Solomonov induction would have it as a major hypothesis after seeing only a couple of examples. But GPT2 saw loads of arithmetic in training, and still couldn't reliably do it. 

So I think LLM architectures form a flat bottomed local semi-minima (minimal in at least most dimensions). It's hard to get big improvements just by tweaking it. (We are applying enough grad student descent to ensure that) but nowhere near global optimal. 

Suppose everything is really data bottlenecked, and the slower AI has a more data efficient algorithm. Or maybe the slower AI knows how to make synthetic data, and the human trained AI doesn't. 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-13T23:01:44.095Z · LW · GW

Suppose you give the AI a short duration discrete task. Pick up this box and move it over there. The AI chooses to detonate a nearby explosive, sending everything in the lab flying wildly all over the place. And indeed, the remains of the box are mostly over there. 

 

Ok. Maybe you give it another task. Unscrew a stuck bolt. The robot gets a big crowbar and levers the bolt. The thing it's pushing against for leverage is a vacuum chamber. Its slightly deformed from the force, causing it to leak.

Or maybe it sprays some chemical on the bolt, which dissolves it. And in a later step, something else reacts with the residue, creating a toxic gas. 

I think you need to micromanage the AI. To specify every possible thing in a lot of detail. I don't think you get a 10x labor saving. I am unconvinced you get any labor saving at all. 

After all, to do the task yourself, you just need to find 1 sane plan. But to stop the AI from screwing up, you need to rule out every possible insane plan. Or at least repeatedly read the AI's plan, spot that it's insane, and tell it not to use explosives to mix paint. 

Comment by Donald Hobson (donald-hobson) on Processor clock speeds are not how fast AIs think · 2024-02-13T22:45:30.074Z · LW · GW

(Because the "military" AIs working with humans will have this kind of hardware to hunt them down with)

You need to make a lot of extra assumptions about the world for this reasoning to work. 

These "military" AI's need to exist. And they need to be reasonably loosely chained. If their safety rules are so strict they can't do anything, they can't do anything however fast they don't do it. They need to be actively trying to do their job, as opposed to playing along for the humans but not really caring. They need to be smart enough. If the escaped AI uses some trick that the "military" AI just can't comprehend, then it fails to comprehend again and again, very fast. 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-13T21:48:15.500Z · LW · GW

In the limit of pushing all the work onto humans, you just have humans building a fusion reactor. 

Which is a sensible plan, but is not AI. 

If you have a particular list in mind for what you consider dangerous, I suspect your "red teaming" approach might catch it.

Like I think that, in this causal graph setup, it's not too hard to stop excess radiation leaking out, if you realize that radiation is a danger and work to stop it. 

This doesn't give you a defence against the threats you didn't imagine and the threats you can't imagine. 

Comment by Donald Hobson (donald-hobson) on What are the known difficulties with this alignment approach? · 2024-02-13T21:02:11.036Z · LW · GW

One fairly obvious failure mode is that it has no checks on the other outputs.

So from my understanding, the AI is optimizing it's actions to produce a machine that outputs electricity and helium. Why does it produce a fusion reactor, not a battery and a leaking balloon? 

A fusion reactor will in practice leak some amount of radiation into the environment. This could be a small negligible amount, or a large dangerous amount. 

If the human knows about radiation and thinks of this, they can put a max radiation leaked into the goal. But this is pushing the work onto the humans. 

From my understanding of your proposal, the AI is only thinking about a small part of the world. Say a warehouse that contains some robotic construction equipment, and that you hope will soon contain a fusion reactor, and that doesn't contain any humans. 

The AI isn't predicting the consequences of it's actions over all space and time. 

Thus the AI won't care if humans outside the warehouse die of radiation poisoning, because it's not imagining anything outside the warehouse. 

So, you included radiation levels in your goal. Did you include toxic chemicals? Waste heat? Electromagnetic effects from those big electromagnets that could mess with all sorts of electronics. Bioweapons leaking out? I mean if it's designing a fusion reactor and any bio-nasties are being made, something has gone wrong. What about nanobots. Self replicating nanotech sure would be useful to construct the fusion reactor. Does the AI care if an odd nanobot slips out and grey goos the world? What about other AI. Does your AI care if it makes a "maximize fusion reactors" AI that fills the universe with fusion reactors. 

Comment by Donald Hobson (donald-hobson) on Leading The Parade · 2024-02-13T17:46:13.719Z · LW · GW

I think you are completely overlooking a significant chunk of impact. Suppose that technologies A and B are similar. The techs act as substitutes, say several different designs of engine or something. And if everyone is using tech X, the accumulated experience makes X the better choice. This gives long term control of which path tech goes down to a "who got there first". Could electric cars have taken off before petrol if someone else had led that parade. 

There are plenty of substances that increase fuel octane, so if someone else had led the parade around a substance that didn't contain lead, a lot of brain damage could have been prevented. 

If some non military group had lead nuclear energy, would reactors use thorium instead of uranium?

Comment by Donald Hobson (donald-hobson) on Notes on Innocence · 2024-01-28T21:15:54.364Z · LW · GW

When I try to think of a utopian future, the people in that world understand the concept of such non-innocent things, but correctly have a very low prior probability on them, in their typical interactions between each other. 

Think of dath ilian keepers. Of course they understand the concept of malicious deception, and of course they don't expect it from each other. 

Part of this is just "have a good prior about the likelihood of deception/innuendo etc" with having a prior of 0 as a defense against an overly high prior.

Partly, these could be considered low grade infohazards. Things it is unpleasant but sometimes useful to know. 

And for such things, there are 2 good approaches. Fortifying your mind until you can deal with them without strain. Or hiding in a protected bubble where you don't need to worry about such things. The second strategy relies on someone else being competent and powerful enough to protect you.

Comment by Donald Hobson (donald-hobson) on Epistemic Hell · 2024-01-28T20:56:32.135Z · LW · GW

Quantum many worlds

Comment by Donald Hobson (donald-hobson) on legged robot scaling laws · 2024-01-28T01:52:43.034Z · LW · GW

How efficient could walking be with effective regenerative breaking in each step?

Comment by Donald Hobson (donald-hobson) on A Shutdown Problem Proposal · 2024-01-23T21:53:20.732Z · LW · GW

Yes. I was assuming a standard conditional for the button. 

I can't currently see any flaws with the CDT style. Other than.

  1. Subagents believe in a world where buttons magically press themselves. So this design can't make coherent statements about the probabilty that the button will be pressed.  (one AI believes it's 1, the other that it's 0).
  2. These AI's have no incentive to give humans access to the button. To the AI's, they have a magic button, that might or might not magically press its self. The AI's have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI's would like that very much. The AI's have 0 concern about human's pressing the button. But the AI's have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc. 
Comment by Donald Hobson (donald-hobson) on A Shutdown Problem Proposal · 2024-01-23T18:11:18.537Z · LW · GW

Suppose these 3 scenarios are judged as equally likely.

A) Button pressed. U_1=U_2=0

B) Button not pressed. U_1=U_2=0

C) AI gets control of button. U_1=6, U_2=-6

 

Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)

Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0

Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3

 

Change it around, make AI decide not to press the button in case C, and 

E(U_1|not pressed)=3, E(U_2|pressed)=0.

Comment by Donald Hobson (donald-hobson) on A Shutdown Problem Proposal · 2024-01-23T16:06:50.541Z · LW · GW

Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.

I disagree. 

Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.

The AI has been entered by someone else, not playing isn't an option. 

Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.

To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.

For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility. 

So both subagents agree on a plan that conditionally manipulates the button.

Comment by Donald Hobson (donald-hobson) on Bayesians Commit the Gambler's Fallacy · 2024-01-17T23:35:05.047Z · LW · GW

You are smuggling your conclusion in with slight technical choices of switchy vs sticky. 

If we make the process markovian, ie  the probability of getting heads depends only on if the previous flip was heads, then this disappears. 

If we make switchy or sticky strongest after a long sequence of switches, this disappears. 

You need to justify why switchy/sticky processes should use these switchy/sticky probabilities.

Comment by Donald Hobson (donald-hobson) on An even deeper atheism · 2024-01-12T02:58:56.721Z · LW · GW

Toy example. Suppose every person wanted lots of tasty food for themselves. No one cares in the slightest about other people starving. 

In this scenario, everyone is a paperclipper with respect to everyone else, and yet we can all agree that it's a good idea to build a "feed everyone AI". 

Sometimes you don't need your values to be in control, you just need them to be included.

Comment by Donald Hobson (donald-hobson) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-08T23:28:43.329Z · LW · GW

The connection to features is that if the answer is no, there is no possible way the network could have arbitrary X-or combos of features that are linearly represented. It must be only representing some small subset of them. (probably the xor's of 2 or 3 features, but not 100 features.)

Also, your maths description of the question matches what I was trying to express. 

Comment by Donald Hobson (donald-hobson) on What’s up with LLMs representing XORs of arbitrary features? · 2024-01-06T19:42:39.838Z · LW · GW

Take  this set contains exponentially many points. Is there Any function  such that all exponentially many xor combos can be found by a linear probe? 

This is a question of pure maths, it involves no neural networks. And I think it would be highly informative.

Comment by Donald Hobson (donald-hobson) on Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) · 2023-12-31T11:19:05.377Z · LW · GW

You are making the structure of time into a fundamental part of your agent design, not a contingency of physics.

Let an aput be an input or an output. Let an policy be a subset of possible aputs. Some policies are physically valid. 

Ie a policy must have the property that, for each input, there is a single output. If the computer is reversible, the policy must be a bijection from inputs to outputs. If the computer can create a contradiction internally, stopping the timeline, then a policy must be a map from inputs to at most one output. 

If the agent is actually split into several pieces with lightspeed and bandwidth limits, then the policy mustn't use info it can't have. 

But these physical details don't matter. 

The agent has some set of physically valid policies, and it must pick one. 

Comment by Donald Hobson (donald-hobson) on The Mountain Troll · 2023-12-30T01:03:46.121Z · LW · GW

As a human mind, I have a built in default system of beliefs. That system is a crude "sounds plausible" intuition. This mostly works pretty well, but it isn't perfect.

This crude system heard about probability theory, and assigned it a "seems true" marker. The background system, as used before learning probability theory, kind of roughly approximates part of probability theory. But it's not a system that produces explicit numbers. 

So I can't assign a probability to baysianism being true, because the part of my mind that decided it was true isn't using explicit probabilities, just feelings. 

Comment by Donald Hobson (donald-hobson) on Social Dark Matter · 2023-11-25T00:06:12.917Z · LW · GW

bug secretions must be good, actually, or at least they can be good!”

Honey?

Comment by Donald Hobson (donald-hobson) on Threat-Resistant Bargaining Megapost: Introducing the ROSE Value · 2023-11-21T18:48:20.218Z · LW · GW

Suppose Bob is a baker who has made some bread. He can give the bread to Alice, or bin it. 

By the ROSE value, Alice should pay $0.01 to Bob for the bread.

How is an honest baker supposed to make a profit like that?

But suppose, before the bread is baked, Bob phones Alice. 

"Well the ingredients cost me $1" he says, "how much do you want the bread?"

If Alice knows pre baking that she will definitely want bread, she would commit to paying $1.01 for it, if she valued the bread at at least that much. If Alice has a 50% chance of wanting bread, she could pay $1.01 with certainty, or equivalently pay $2.02 in the cases where she did want the bread. The latter makes sense if Alice only pays in cash and will only drive into town if she does want the bread.

If Alice has some chance of really wanting bread, and some chance of only slightly wanting bread, it's even more complicated. The average bill across all worlds is $1.01, but each alternate version of Alice wants to pay less than that.

Comment by Donald Hobson (donald-hobson) on SIA > SSA, part 1: Learning from the fact that you exist · 2023-11-20T19:16:00.833Z · LW · GW

Personally I think both SSA and SIA are wrong.