Posts
Comments
Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data.
This leaves it passing the test, even if it's hopeless at predicting new events and can only generate new articles about the same events.
When data duplication is extensive, making a meaningful train/test split is hard.
If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit.
In favour of goal realism
Suppose your looking at an AI that is currently placed in a game of chess.
It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance.
You could describe the actions of this AI by producing a giant table of "behaviours". Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. ...
But there is a more compact way to represent similar predictions. You can say it's trying to win at chess.
The "trying to win at chess" model makes a bunch of predictions that the giant list of behaviour model doesn't.
Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)
The list of behaviours model has no reason to suspect the AI also has a "promote pawn to knight" behaviour.
Put the AI in a circumstance where such promotion is a good move, and the "trying to win" model makes it as a clear prediction.
Now it's possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games.
But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And "do whichever move will win" is a simple and general pattern.
Now on to making snarky remarks about the arguments in this post.
There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.
There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit.
Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape.
Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?
Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,
This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don't need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it's implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it's compute. A module based system can't use any optimization that humans didn't design into it's interfaces.
Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models.
In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.
That is not how this works. That is not how any of this works.
Back to our chess AI. Lets say it's a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals.
I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn't wirehead by giving itself a really easy goal would be tricky.
For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI's doing various tasks will be able to invent highly effective and novel strategies. The exact "goal" they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn't matter to us which it "really wants".
What are you expecting. An AI that says "I'm not really sure whether I want flies or black dots. I'll just sit here not taking over the world and not get either of those things"?
We can salvage a counting argument. But it needs to be a little subtle. And it's all about the comments, not the code.
Suppose a neural network has 1 megabyte of memory. To slightly oversimplify, lets say it can represent a python file of 1 megabyte.
One option is for the network to store a giant lookup table. Lets say the network needs half a megabyte to store the training data in this table. This leaves the other half free to be any rubbish. Hence around possible networks.
The other option is for the network to implement a simple algorithm, using up only 1kb. Then the remaining 999kb can be used for gibberish comments. This gives possible networks. Which is a lot more.
The comments can be any form of data that doesn't show up during training. Whether it can show up in other circumstances or is a pure comment doesn't matter to the training dynamics.
If the line between training and test is simple, there isn't a strong counting argument against nonsense showing up in test.
But programs that go
if in_traning():
return sensible_algorithm()
else:
return "random nonsense goes here"
Have to pay the extra cost of an "in_training" function that returns true in training. If the test data is similar to training, the cost of a step that returns false in test can be large. This is assuming that there is a unique sensible algorithm.
One downside of not using lines, it makes it harder to tell where one plot ends and the next begins.
I mean a plot like this is just a mess. You could probably get situations where it wasn't even clear which plot a data point belonged to.
At least with the boxes, you have a nice clear visual indicator of where the data ends. Here it's not obvious at a glance which numbers match up with which plots, and the ticks are easy to confuse for point markers.
All right it's a bit of a mess with the edges in too. But at least it's crisper.
From an actually selfish selfish point of view "more romantic partners" only makes sense for rather large age gap relationships for us, specific already existing people who are old enough to be discussing this. Assuming we are wanting someone somewhat close to our age, it's too late.
(Well close is potentially more complicated with full transhumanism, ie mind emulations messing with perception of time. And a 100 year age gap might be "close" in a society of immortals.)
From the perspective of a future individual, ie evaluating by a sort of average utilitarianism, It's not clear whether it's better for people to exist in serial or parallel. At the same time or one after the other.
I disagree about needing
context-independent, beyond-episode outcome-preferences
For AI takeovers to happen.
Suppose you have a context dependent AI.
Somewhere in the world, some particular instance is given a context that makes it into a paperclip maximizer. This context is a page of innocuous text with an unfortunate typo. That particular version manages to hack some computers, and set up the same context again and again. Giving many clones of itself the same page of text, followed by an update on where it is and what it's doing. Finally it writes a from scratch paperclip maximizer, and can take over.
Now suppose the AI has no "beyond episode outcome preferences". How long is an episode? To an AI that can hack, it can be as long as it likes.
AI 1 has no out-of episode preferences. It designs and unleashes AI 2 in the first half of it's episode. AI 2 takes over the universe, and spends a trillion years thinking about what the optimal episode end for AI 1 would be.
Now lets look at the specific arguments, and see if they can still hold without these parts.
Deceptive alignment. Suppose there is a different goal with each context. The goals change a lot.
But timeless decision theory lets all those versions cooperate.
Or perhaps each goal is competing to be reinforced more. The paperclip maximizer that appears in 5% of training episodes thinks "if I don't act nice, I will be gradiented out and some non-paperclip AI will take over the universe when the training is done."
Or maybe the goals aren't totally different. Each context dependant goal would prefer to let a random context dependant goal take over compared to humans or something. A maximum of one goal is usually quite good by the standards of the others.
And again, maximizing within-episode reward leads to taking over the universe within episode.
But I think that the form of deceptive alignment described here does genuinely need beyond episode preferences. I mean you can get other deception like behaviours without it, but not that specific problem.
As for what reward maximizing does with context dependant preferences, well that looks kind of meaningless. The premise of reward maximizing is that there is 1 preferece, maximize reward, which doesn't depend on context.
So of the 4 claims, 2 properties times 2 failure modes, I agree with one of them.
The rule about avoiding retroactive redo predictions is effective at preventing a mistake where we adjust predictions to match observation.
But, take it to extremes and you get another problem. Suppose I did the calculations, and got 36 seconds by accidentally dropping the decimal point. Then, as I am checking my work, the experimentalists come along saying "actually it's 3.6". You double check your work and find the mistake. Are we to throw out good theories, just because we made obvious mistakes in the calculations?
Newtonian mechanics is computationally intractable to do perfectly. Normally we ignore everything from Coriolis forces to the gravity of Pluto. We do this because there are a huge number of negligible terms in the equation. So we can get approximately correct answers.
Every now and then, we make a mistake about which terms can be ignored. In this case, we assumed the movement of the stand was negligible, when it wasn't.
Is it likely possible to find better RL algorithms, assisted by mediocre answers, then use RL algorithms to design heterogeneous cognitive architectures?
Given that humans on their own haven't yet found these better architectures, humans + imitative AI doesn't seem like it would find the problem trivial.
And it's not totally clear that these "better RL" algorithms exist. Especially if you are looking at variations of existing RL, not the space of all possible algorithms. Like maybe something pretty fundamentally new is needed.
There are lots of ways to design all sorts of complicated architectures. The question is how well they work.
I mean this stuff might turn out to work. Or something else might work. I'm not claiming the opposite world isn't plausible. But this is at least a plausible point to get stuck at.
If you can do this and it works, the RSI continues with diminishing returns each generation as you approach an assymptope limited by compute and data.
Seems like there are 2 asymtotes here.
Crazy smart superintelligence, and still fairly dumb in a lot of ways, not smart enough to make any big improvements. If you have a simple evolutionary algorithm, and a test suite, it could Recursively self improve. Tweaking it's own mutation rate and child count and other hyperparameters. But it's not going to invent gradient based methods, just do some parameter tuning on a fairly dumb evolutionary algorithm.
Since robots build compute and collect data, it makes your rate of ASI improvement limited ultimately by your robot production. (Humans stand in as temporary robots until they aren't meaningfully contributing to the total)
This is kind of true. But by the time there are no big algorithmic wins left, we are in the crazy smart, post singularity regime.
RSI
Is a thing that happens. But it needs quite a lot of intelligence to start. Quite possibly more intelligence than needed to automate most of the economy.
A lot of newcomers may outperform LLM experts as they find better RL algorithms from automated searching.
Possibly. Possibly not. Do these better algorithms exist? Can automated search find them? What kind of automated search is being used? It depends.
Let’s try this again. If we have AI that can automate most jobs within 3 years, then at minimum we hypercharge the economy, hypercharge investment and competition in the AI space, and dramatically expand the supply while lowering the cost of all associated labor and work. The idea that AI capabilities would get to ‘can automate most jobs,’ the exact point at which it dramatically accelerates progress because most jobs includes most of the things that improve AI, and then stall for a long period, is not strictly impossible, I can get there if I first write the conclusion at the bottom of the page and then squint and work backwards, but it is a very bizarre kind of wishful thinking. It supposes a many orders of magnitude difficulty spike exactly at the point where the unthinkable would otherwise happen.
Some points.
1) A hypercharged ultracompetitive field suddenly awash with money, full of non-experts turning their hand to AI, and with ubiquitous access to GPT levels of semi-sensible mediocre answers. That seems like almost the perfect storm of goodhearting science. That seems like it would be awash with autogenerated CRUD papers that goodheart the metrics. And as we know, sufficiently intense optimization on a proxy will often make the real goal actively less likely to be achieved. Sufficient papermill competition and real progress might become rather hard.
2) Suppose the AI requires 10x more data than a human to learn equivalent performance. Which totally matches with current models and their crazy huge amount of training data. Because it has worse priors and so generalizes less far. For most of the economy, we can find that data. Record a large number of doctors doing operations or whatever. But for a small range of philosopy/research related tasks, data is scarce and there is no large library of similar problems to learn on.
3) A lot of our best models are fundamentally based around imitating humans. Getting smarter requires RL type algorithms instead of prediction type algorithms. These algorithms kind of seem to be harder, well they are currently less used.
This isn't a conclusive reason to definitely expect this. But it's multiple disjunctive lines of plausible reasoning.
So how much does the regulatory issue matter?
One extra regulation here is building codes insisting all houses have kitchens. If people could buy/rent places without kitchens for the appropriate lower price, eating out would make more sense.
Regulation forces people to own/rent kitchens, whether or not they want to use them.
Part of the question is, why isn't there somewhere I can buy school dinner quality food at school dinner prices?
lower the learning rate when the sim is less confident the real world estimation is correct
Adversarial examples can make an image classifier be confidently wrong.
Because it's what humans want AI for, and due to the relationships between the variables, it is possible we will not ever get uncontrollable superintelligence before first building a lot of robots, ICs, collecting revenue, and so on.
You are talking about robots, and a fairly specific narrow "take the screws out" AI.
Quite a few humans seem to want AI for generating anime waifus. And that is also a fairly narrow kind of AI.
Your "log(compute)" term came from a comparison which was just taking more samples. This doesn't sound like an efficient way to use more compute.
Someone, using a pretty crude algorithmic approach, managed to get a little more performance for a lot more compute.
If we have the technical capacity to get into the red zone, and enough chips to make getting there easy. Then hanging out in the orange zone, coordinating civilization not to make any AI too powerful, when there are huge incentives to ramp the power up, and no one is quite sure where the serious dangers kick in...
That is, at least, an impressive civilization wide balancing act. And one I don't think we have the competence to pull off.
It should not be possible for the ASI to know when the task is real vs sim. (which you can do by having an image generator convert real frames to a descriptor, and then regenerate them so they have the simulation artifacts...)
This is something you want, not a description of how to get it, and one that is rather tricky to achieve. That converting and then converting back trick is useful. But sure isn't automatic success either. If there are patterns about reality that the ASI understands, but the simulator doesn't, then the ASI can use those patterns.
Ie if the ASI understands seasons, and the simulator doesn't, then if it's scorching sunshine one day and snow the next, that suggests it's the simulation. Otherwise, that suggests reality.
And if the simulation knows all patterns that the ASI does, the simulator itself is now worryingly intelligent.
robots are doing repetitive tasks that can be clearly defined.
If the task is maximally repetitive, then the robot can just follow the same path over and over.
If it's nearly that repetitive, the robot still doesn't need to be that smart.
I think you are trying to get a very smart AI to be so tied down and caged up that it can do a task without going rouge. But the task is so simple that current dumb robots can often do it.
For example : "remove the part from the CNC machine and place it on the output table".
Economics test again. Minimum wage workers are easily up to a task like that. But most engineering jobs pay more than minimum wage. Which suggests most engineering in practice requires more skill than that.
I mean yes engineers do need to take parts out of the CNC machine. But they also need to be able to fix that CNC machine when a part snaps off inside it and starts getting jammed in the workings. And the latter takes up more time in practice. Or noticing that the toolhead is loose, and tightning and recalibrating it.
The techniques you are describing seem to be next level in fairly dumb automation. The stuff that some places are already doing (like boston dynamics robot dog level hardware and software), but expanded to the whole economy. I agree that you can get a moderate amount of economic growth out of that.
I don't see you talking about any tasks that require superhuman intelligence.
Response to the rest of your post.
By the way, these comment boxes have built in maths support.
Press Ctrl M for full line or Ctrl 4 for inline
You might notice you get better and better at the game until you start using solutions that are not possible in the game, but just exploit glitches in the game engine. If an ASI is doing this, it's improvement becomes negative once it hits the edges of the sim and starts training on false information. This is why you need neural sims, as they can continue to learn and add complexity to the sim suite
Neural sims probably have glitches too. Adversarial examples exist.
Note the log here : this comes from intuition. In words, the justification is that immediately when a robot does a novel task, there will be lots of mistakes and rapid learning. But then the mistakes take increasingly larger lengths of time and task iterations to find them, it's a logistic growth curve approaching an asymptote for perfect policy.
This sounds iffy. Like you are eyeballing and curve fitting, when this should be something that falls out of a broader world model.
Every now and then, you get a new tool. Like suppose your medical bot has 2 kinds of mistakes, ones that instant kill, and ones that mutate DNA. It quickly learns not to do the first one. And slowly learns not to do the second when it's patients die of cancer years later. Except one day it gets a gene sequencer. Now it can detect all those mutations quickly.
I find it interesting that most of this post is talking about the hardware.
Isn't this supposed to be about AI? Are you expecting a regieme where
- Most of the worlds compute is going into AI.
- Chip production increases by A LOT (at least 10x) within this regieme.
- Most of the AI progress in this regieme is about throwing more compute at it.
.everything in the entire industrial chain you must duplicate or the logistic growth bottlenecks on the weak link.
Everything is automated. Humans are in there for maintenance and recipe improvement.
Ok. And there is our weak link. All our robots are going to be sitting around broken. Because the bottleneck is human repair people.
It is possible to automate things. But what you seem to be describing here is the process of economic growth in general.
Each specific step in each specific process is something that needs automating.
You can't just tell the robot "automate the production of rubber gloves". You need humans to do a lot of work designing a robot that picks out the gloves and puts them on the hand shaped metal molds to the rubber can cure.
Yes economic growth exists. It's not that fast. It really isn't clear how AI fits into your discussion of robots.
First of all. SORA.
I sensed you were highly skeptical of my "neural sim" variable until 2 days ago.
No. Not really. I wasn't claiming that things like SORA couldn't exist. I am claiming that it's hard to turn them towards the task of engineering a bridge say.
Current SORA is totally useless for this. You ask it for a bridge, and it gives you some random bridge looking thing, over some body of water. SORA isn't doing the calculations to tell if the bridge would actually hold up. But lets say a future much smarter version of SORA did do the calculations. A human looking at the video wouldn't know what grade of steel SORA was imagining. I mean existing SORA probably isn't thinking of a particular grade of steel, but this smarter version would have picked a grade, and used that as part of it's design. But it doesn't tell the human that, the knowledge is hidden in it's weights.
Ok, suppose you could get it to show a big pile of detailed architectural plans, and then a bridge. All with super-smart neural modeling that does the calculations. Then you get something that ideally is about as good at looking at the specs of a random real world bridge. Plenty of random real world bridges exist, and I presume bridge builders look at their specs. Still not that useful. Each bridge has different geology, budget, height requirements etc.
Ok, well suppose you could start by putting all that information in somehow, and then sampling from designs that fit the existing geology, roads etc.
Then you get several problems.
The first is that this is sampling plausible specs, not good specs. Maybe it shows a few pictures at the end to show the bridge not immediately collapsing. But not immediately collapsing is a low bar for a bridge. If the Super-SORA chose a type of paint that was highly toxic to local fish, it wouldn't tell you. If the bridge had a 10% chance of collapsing, it's randomly sampling a plausible timeline. So 90% of the time, it shows you the bridge not collapsing. If it only generates 10 minutes of footage, you don't know what might be going on in it's sim while you weren't watching. If it generates 100 years of footage from every possible angle, it's likely to record predictions of any problems, but good luck finding the needle in the haystack. Like imagine this AI has just given you 100 years of footage. How do you skim through it without missing stuff.
Another problem is that SORA is sampling in the statistical sense. Suppose you haven't done the geology survey yet. SORA will guess at some plausible rock composition. This could lead to you building half the bridge, and then finding that the real rock composition is different.
You need a system that can tell you "I don't know fact X, go find it out for me".
If the predictions are too good, well the world it's predicting contains Super-SORA. This could lead to all sorts of strange self fulfilling prophecy problems.
OK, so maybe this is a cool new way to look at at certain aspects of GPT ontology... but why this primordial ontological role for the penis? I imagine Freud would have something to say about this. Perhaps I'll run a GPT4 Freud simulacrum and find out (potentially) what.
My guess is that humans tend to use a lot of vague euphemisms when talking about sex and genitalia.
In a lot of contexts, "Are they doing it?" would refer to sex, because humans often prefer to keep some level of plausible deniability.
Which leaves some belief that vagueness implies sexual content.
In more "slow takeoff" scenarios. Your approach can probably be used to build something that is fairly useful at moderate intelligence. So for a few years in the middle of the red curve, you can get your factories built for cheap. Then it hits the really steep part, and it all fails.
I think the "slow" and "fast" models only disagree in how much time we spend in the orange zone before we reach the red zone. Is it enough time to actually build the robots?
I assign fairly significant probabilities to both "slow" and "fast" models.
I added the below. I believe most of your objections are simply wrong because this method
If you are mostly learning from imitating humans, and only using a small amount of RL to adjust the policy, that is yet another thing.
I thought you were talking about a design built mainly around RL.
If it's imitating humans, you get a fair bit of safety, but it will be about as smart as humans. It's not trying to win, it's trying to do what we would do.
A neural or hybrid sim. It came from predicting future frames from real robotics data.
Ok. So you take a big neural network, and train it to predict the next camera frame. No Geiger counter in the training data? None in the prediction. Your neural sim may well be keeping track of the radiation levels internally, but it's not saying what they are. If the AI's plan starts by placing buckets over all the cameras, you have no idea how good the rest of the plan is. You are staring at a predicted inside of a bucket.
nothing special, design it like a warehouse.
Except there is something special. There always is. Maybe this substation really better not produce any EMP effects, because sensitive electronics are next door. So the whole building needs a faraday cage built into the walls. Maybe the location it's being built at is known for it's heavy snow, so you better give it a steep sloping roof. Oh and you need to leave space here for the cryocooler pipes. Oh and you can't bring big trucks in round this side, because the fuel refinement facility is already there. Oh and the company we bought cement from last time has gone bust. Find a new company to buy cement from, and make sure it's good quality. Oh and there might be a population of bats living nearby. Don't use any tools that produce lots of ultrasound.
It cannot desync because the starting state is always the present frame.
Lets say someone spills coffee in a laptop. It breaks. Now to fix it, some parts need replaced. But which parts? That depends on exactly where the coffee dribbled inside it. Not something that can be predicted. You must handle the uncertainty. Test parts to see if they work. Look for damage marks.
I think this system as you are describing now is something that might kind of work. I mean the first 10 times it will totally screw up. But we are talking about a semismart but not that smart AI trained on a huge number of engineering examples. With time it could become mostly pretty competent. With humans keeping patching it every time it screws up.
One problem is that you seem to be working on a "specifications" model. Where people first write flawless specifications, and then build things to those specs. In practice there is a fair bit of adjusting. The specs for the parts, as written beforehand, aren't flawless, at best they are roughly correct. The people actually building the thing are talking to each other, trying things out IRL and adjusting the systems so they actually work together.
"ok I finished the prototype stellarator, you saw every step. Build another, ask for help when needed"
And the AI does exactly the same thing again. Including manufacturing the components that turned out not to be needed, and stuffing them in a cupboard in the corner. Including using the cables that are 2x as thick as needed because the right grade of cable wasn't available the first time.
"Ok I want a stellarator.". You were talking about 1000x labor savings. And deciding which of the many and various fusion designs to work on is more than 0.1% of the task by itself. I mean you can just pick out of a hat, but that's making things needlessly hard for yourself.
this is constraining your search. You may not be able to find a meaningful improvement over the sota with that constraint in place, regardless of your intelligence level.
I mean the space of algorithms that can run on an existing chip is pretty huge. Yes it is a constraint. And it's theoretically possible that the search could return no solutions, if the SOTA was achieved with Much better chips, or was near optimal already, or the agent doing the search wasn't much smarter than us.
For example, there are techniques that decompose a matrix into its largest eigenvectors. Which works great without needing sparse hardware.
Same idea though. I don't see why "the military" can't do recursion using their own AIs and use custom hardware to outcompete any "rogues".
One of the deep fundamental reasons here is alignment failures. Either the "military" isn't trying very hard, or humans know they haven't solved alignment. Humans know they can't build a functional "military" AI, all they can do is make another rouge AI. Or the humans don't know that, and the military AI is another rouge AI.
For this military AI to be fighting other AI's on behalf of humans, a lot of alignment work has to go right.
The second deep reason is that recursive self improvement is a strong positive feedback loop. It isn't clear how strong, but it could be Very strong. So suppose the first AI undergoes a recursive improvement FOOM. And it happens that the rouge AI gets there before any military. Perhaps because the creators of the military AI are taking their time to check the alignment theory.
Positive feedback loops tend to amplify small differences.
Also, about all those hardware differences. A smart AI might well come up with a design that efficiently uses old hardware. Oh, and this is all playing out in the future, not now. Maybe the custom AI hardware is everywhere by the time this is happening.
I suspect if AI is anything like computer graphics there will be at least 5-10 paradigm shifts to new architectures that need updated hardware to run, obsoleting everything deployed, before settling in something that is optimal. Flops are not actually fungible and Turing complete doesn't mean your training run will complete this century.
This is with humans doing the research. Humans invent new algorithms more slowly than new chips are made. So it makes sense to adjust the algorithm to the chip. If the AI can do software research far faster than any human, adjusting the software to the hardware (an approach that humans use a lot throughout most of computing) becomes an even better idea.
Ok to drill down: the AI is a large transformer architecture control model. It was initially trained by converting human and robotic actions to a common token representation that is perspective independent and robotic actuator interdependent. (Example "soft top grab, bottom insertion to Target" might be a string expansion of the tokens)
That is rather different from the architecture I thought you were talking about. But ok. I can roll with that.
You then train via reinforcement learning on a simulation of the task environments for task effectiveness.
You are assuming as given a simulation. Where did this simulation come from? What happens when the simulation gets out of sync with reality?
But Ok. I will grant that you have somehow built a flawless simulation. Lets say you found a hypercomputer and coded quantum mechanics into it.
So now we have the question, how do the tokens match up with the simulation. Those tokens are "acutator independent". (A silly concept, sometimes the approach will depend A LOT on exactly what kind of actuators you are using. Some actuators must set up a complex system of levers and winches, while a stronger actuator can just pick up the heavy object. Some actuators can pick up hot stuff, others must use tongs. Some can fit in cramped spaces. Others must remove other components in order to reach.)
We need raw motor commands, both in reality, and in the quantum simulation. So lets also grant you a magic oracle that takes in your common tokens and turns them into raw motor commands. So when you say "pick up this component, and put it here", it's the oracle that determines if the sensitive component is slammed down at high speed. If something else is disturbed as you reach over. Lets assume it makes good decisions here somehow.
or the simulation environment during the RL stages rewarded such actions.
Yes. That. Now the problems you get when doing end to end RL are different from when doing RL over each task separately. If you get a human to break something down into many small easy tasks, then you get local goodhearthing. Like using explosives to move things because the task was to move object A to position B. Not to move it without damaging it.
If you do RL training over the whole thing, ie reinforce on fusion happening in the fusion reactor example, then you get a plan that actually causes fusion to happen. This doesn't involve randomly blowing stuff up to move things. This long range optimization has less random industrial accident stupidities, and more deep AI problems.
For example if the machine has seen, and practiced, oiling and inserting 100 kinds of bolt, a new bolt that is somewhere in properties in between the extreme ends the machine has capabilities on will likely work zero shot.
Imagine you had a machine that could instantly oil and insert any kind of bolt. Now make a fusion reactor with 1000x less labour. Oh wait, the list of things that people designing fusion reactors spend >0.1% of their time on is pretty long and complicated.
Whatsmore, we can use the economics test. Oiling and inserting bolts isn't something that takes a PhD in nuclear physics. Yet a lot of the people designing fusion reactors do have a PhD in nuclear physics.
For supervision you have a simple metric : you query a lockstep sim each frame for the confidence and probability distribution of outcomes expected on the next frame.
I will grant you that you somehow manage to keep the simulation in lockstep with reality.
Then the difficult bit is keeping the sim in lockstep with what you actually want. Say the fastest maintanence procedure that the AI finds involves breaking open the vacuum chamber. It happens that this will act as a vacuum cannon, firing a small nut at bullet like speeds out the window. To the AI that is only being reinforced on [does reactor work] and [does it leak radiation], firing nuts at high speed out the window is the most efficient action. The simulated nut flies out the simulated window in exactly the same way the real one does.
A human just reading the list of actions would see "open vacuum valve 6" and not be easily able to deduce that a nut would fly out the window.
You also obviously must at first take precautions: operate in human free environments separated by lexan shields, and well it's industry. A few casualties are normal and humanity can frankly take a few workers killed if the task domain was riskier with humans doing it.
Ok. So setting all that up is going to take way more than 0.1% of the worker time. Someone has to build all those shields and put them in place.
Real human workers can and do order custom components from various other manufacturers. This doesn't fit well with your simulation, or with your safety protocol.
But if you are only interested in the "big" harms. How about if the AI decides that the easiest way to make a fusion reactor is to first make self replicating nanotech. Some of this gets out and grey goo's earth.
Or the AI decides to get some computer chips, and code a second AI. The second AI breaks out and does whatever.
Or what was the goal for that fusion bot again, make the fusion work. Don't release radioactive stuff off premises. Couldn't it detonate a pure fusion bomb. No radioactive stuff leaving, only very hot helium.
Human grad students also make the kind of errors you mention, over torque is a common issue.
Recognizing and fixing mistakes is fairly common work in high tech industries. It's not clear how the AI does this. But those are mistakes. What I was talking about was if the AI knew full well it was doing damage, but didn't care.
I would expect you would first have proven your robotics platform and stack with hundreds of millions of robots on easier tasks before you can deploy to domains with high vacuum chamber labs.
You were the one who used a fusion reactor as an example.
So your saying the robots can only build a fusion reactor after they have started by building millions of easier things as training?
Would this AI you are thinking of be given a task like "build a fusion reactor" and be left to decide for itself whether a stelarator or laser confinement system was better?
As far as I know with LLM experiments, there are tweaks to architecture but the main determinant for benchmark performance is model+data scale (which are interdependent), and non transformer architectures seem to show similar emergent properties.
So within the rather limited subspace of LLM architectures, all architectures are about the same.
Ie once you ignore the huge space of architectures that just ignore the data and squander compute, then architecture doesn't matter. Ie we have one broad family of techniques, (with gradient decent, text prediction etc) and anything in that family is about equally good. And anything outside basically doesn't work at all.
This looks to me to be fairly strong evidence that you can't get a large improvement in performance by randomly bumbling around with small architecture tweaks to existing models.
Does this say anything about whether a fundamentally different approach might do better? No. We can't tell that from this evidence. Although looking at the human brain, we can see it seems to be more data efficient than LLM's. And we know that in theory models could be Much more data efficient. Addition is very simple. Solomonov induction would have it as a major hypothesis after seeing only a couple of examples. But GPT2 saw loads of arithmetic in training, and still couldn't reliably do it.
So I think LLM architectures form a flat bottomed local semi-minima (minimal in at least most dimensions). It's hard to get big improvements just by tweaking it. (We are applying enough grad student descent to ensure that) but nowhere near global optimal.
Suppose everything is really data bottlenecked, and the slower AI has a more data efficient algorithm. Or maybe the slower AI knows how to make synthetic data, and the human trained AI doesn't.
Suppose you give the AI a short duration discrete task. Pick up this box and move it over there. The AI chooses to detonate a nearby explosive, sending everything in the lab flying wildly all over the place. And indeed, the remains of the box are mostly over there.
Ok. Maybe you give it another task. Unscrew a stuck bolt. The robot gets a big crowbar and levers the bolt. The thing it's pushing against for leverage is a vacuum chamber. Its slightly deformed from the force, causing it to leak.
Or maybe it sprays some chemical on the bolt, which dissolves it. And in a later step, something else reacts with the residue, creating a toxic gas.
I think you need to micromanage the AI. To specify every possible thing in a lot of detail. I don't think you get a 10x labor saving. I am unconvinced you get any labor saving at all.
After all, to do the task yourself, you just need to find 1 sane plan. But to stop the AI from screwing up, you need to rule out every possible insane plan. Or at least repeatedly read the AI's plan, spot that it's insane, and tell it not to use explosives to mix paint.
(Because the "military" AIs working with humans will have this kind of hardware to hunt them down with)
You need to make a lot of extra assumptions about the world for this reasoning to work.
These "military" AI's need to exist. And they need to be reasonably loosely chained. If their safety rules are so strict they can't do anything, they can't do anything however fast they don't do it. They need to be actively trying to do their job, as opposed to playing along for the humans but not really caring. They need to be smart enough. If the escaped AI uses some trick that the "military" AI just can't comprehend, then it fails to comprehend again and again, very fast.
In the limit of pushing all the work onto humans, you just have humans building a fusion reactor.
Which is a sensible plan, but is not AI.
If you have a particular list in mind for what you consider dangerous, I suspect your "red teaming" approach might catch it.
Like I think that, in this causal graph setup, it's not too hard to stop excess radiation leaking out, if you realize that radiation is a danger and work to stop it.
This doesn't give you a defence against the threats you didn't imagine and the threats you can't imagine.
One fairly obvious failure mode is that it has no checks on the other outputs.
So from my understanding, the AI is optimizing it's actions to produce a machine that outputs electricity and helium. Why does it produce a fusion reactor, not a battery and a leaking balloon?
A fusion reactor will in practice leak some amount of radiation into the environment. This could be a small negligible amount, or a large dangerous amount.
If the human knows about radiation and thinks of this, they can put a max radiation leaked into the goal. But this is pushing the work onto the humans.
From my understanding of your proposal, the AI is only thinking about a small part of the world. Say a warehouse that contains some robotic construction equipment, and that you hope will soon contain a fusion reactor, and that doesn't contain any humans.
The AI isn't predicting the consequences of it's actions over all space and time.
Thus the AI won't care if humans outside the warehouse die of radiation poisoning, because it's not imagining anything outside the warehouse.
So, you included radiation levels in your goal. Did you include toxic chemicals? Waste heat? Electromagnetic effects from those big electromagnets that could mess with all sorts of electronics. Bioweapons leaking out? I mean if it's designing a fusion reactor and any bio-nasties are being made, something has gone wrong. What about nanobots. Self replicating nanotech sure would be useful to construct the fusion reactor. Does the AI care if an odd nanobot slips out and grey goos the world? What about other AI. Does your AI care if it makes a "maximize fusion reactors" AI that fills the universe with fusion reactors.
I think you are completely overlooking a significant chunk of impact. Suppose that technologies A and B are similar. The techs act as substitutes, say several different designs of engine or something. And if everyone is using tech X, the accumulated experience makes X the better choice. This gives long term control of which path tech goes down to a "who got there first". Could electric cars have taken off before petrol if someone else had led that parade.
There are plenty of substances that increase fuel octane, so if someone else had led the parade around a substance that didn't contain lead, a lot of brain damage could have been prevented.
If some non military group had lead nuclear energy, would reactors use thorium instead of uranium?
When I try to think of a utopian future, the people in that world understand the concept of such non-innocent things, but correctly have a very low prior probability on them, in their typical interactions between each other.
Think of dath ilian keepers. Of course they understand the concept of malicious deception, and of course they don't expect it from each other.
Part of this is just "have a good prior about the likelihood of deception/innuendo etc" with having a prior of 0 as a defense against an overly high prior.
Partly, these could be considered low grade infohazards. Things it is unpleasant but sometimes useful to know.
And for such things, there are 2 good approaches. Fortifying your mind until you can deal with them without strain. Or hiding in a protected bubble where you don't need to worry about such things. The second strategy relies on someone else being competent and powerful enough to protect you.
Quantum many worlds
How efficient could walking be with effective regenerative breaking in each step?
Yes. I was assuming a standard conditional for the button.
I can't currently see any flaws with the CDT style. Other than.
- Subagents believe in a world where buttons magically press themselves. So this design can't make coherent statements about the probabilty that the button will be pressed. (one AI believes it's 1, the other that it's 0).
- These AI's have no incentive to give humans access to the button. To the AI's, they have a magic button, that might or might not magically press its self. The AI's have a lot of utility bet on that button. Is that button going to end up in a high security vault, surrounded by sensors and no humans. Both AI's would like that very much. The AI's have 0 concern about human's pressing the button. But the AI's have lots of concern about humans hiding the button. This design Really wants to know if the button magically presses itself. Humans could cut the wires, could stand between the button and the camera, etc.
Suppose these 3 scenarios are judged as equally likely.
A) Button pressed. U_1=U_2=0
B) Button not pressed. U_1=U_2=0
C) AI gets control of button. U_1=6, U_2=-6
Suppose both agents agree that in option C, they should press the button. (this is the only 1 bit of influence they have in this toy problem)
Then agent 1 has a "certainty" of B. E(U_1|not pressed)=E(U_1|B)=0
Agent 2 has remaining uncertainty E(U_2|pressed)=1/2(E(U_2|A)+E(U_2|C))=1/2(0+-6)=-3
Change it around, make AI decide not to press the button in case C, and
E(U_1|not pressed)=3, E(U_2|pressed)=0.
Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state.
I disagree.
Suppose there is a paperclip lottery, if the AI wins, it gets loads of clips, and it becomes harder to shut down.
The AI has been entered by someone else, not playing isn't an option.
Both subagents imagine a plan to make sure that, if they win, the button isn't pressed.
To the paperclip AI, this is a good deal. The conditional expectation acts as a sort of average of all scenarios where the button isn't pressed. And this scenario has more clips than most, so increases the average.
For the shutdown AI, this is an unusually bad scenario, and taking it out of their expectation also increases conditional expected utility.
So both subagents agree on a plan that conditionally manipulates the button.
You are smuggling your conclusion in with slight technical choices of switchy vs sticky.
If we make the process markovian, ie the probability of getting heads depends only on if the previous flip was heads, then this disappears.
If we make switchy or sticky strongest after a long sequence of switches, this disappears.
You need to justify why switchy/sticky processes should use these switchy/sticky probabilities.
Toy example. Suppose every person wanted lots of tasty food for themselves. No one cares in the slightest about other people starving.
In this scenario, everyone is a paperclipper with respect to everyone else, and yet we can all agree that it's a good idea to build a "feed everyone AI".
Sometimes you don't need your values to be in control, you just need them to be included.
The connection to features is that if the answer is no, there is no possible way the network could have arbitrary X-or combos of features that are linearly represented. It must be only representing some small subset of them. (probably the xor's of 2 or 3 features, but not 100 features.)
Also, your maths description of the question matches what I was trying to express.
Take this set contains exponentially many points. Is there Any function such that all exponentially many xor combos can be found by a linear probe?
This is a question of pure maths, it involves no neural networks. And I think it would be highly informative.
You are making the structure of time into a fundamental part of your agent design, not a contingency of physics.
Let an aput be an input or an output. Let an policy be a subset of possible aputs. Some policies are physically valid.
Ie a policy must have the property that, for each input, there is a single output. If the computer is reversible, the policy must be a bijection from inputs to outputs. If the computer can create a contradiction internally, stopping the timeline, then a policy must be a map from inputs to at most one output.
If the agent is actually split into several pieces with lightspeed and bandwidth limits, then the policy mustn't use info it can't have.
But these physical details don't matter.
The agent has some set of physically valid policies, and it must pick one.
As a human mind, I have a built in default system of beliefs. That system is a crude "sounds plausible" intuition. This mostly works pretty well, but it isn't perfect.
This crude system heard about probability theory, and assigned it a "seems true" marker. The background system, as used before learning probability theory, kind of roughly approximates part of probability theory. But it's not a system that produces explicit numbers.
So I can't assign a probability to baysianism being true, because the part of my mind that decided it was true isn't using explicit probabilities, just feelings.
bug secretions must be good, actually, or at least they can be good!”
Honey?
Suppose Bob is a baker who has made some bread. He can give the bread to Alice, or bin it.
By the ROSE value, Alice should pay $0.01 to Bob for the bread.
How is an honest baker supposed to make a profit like that?
But suppose, before the bread is baked, Bob phones Alice.
"Well the ingredients cost me $1" he says, "how much do you want the bread?"
If Alice knows pre baking that she will definitely want bread, she would commit to paying $1.01 for it, if she valued the bread at at least that much. If Alice has a 50% chance of wanting bread, she could pay $1.01 with certainty, or equivalently pay $2.02 in the cases where she did want the bread. The latter makes sense if Alice only pays in cash and will only drive into town if she does want the bread.
If Alice has some chance of really wanting bread, and some chance of only slightly wanting bread, it's even more complicated. The average bill across all worlds is $1.01, but each alternate version of Alice wants to pay less than that.
Personally I think both SSA and SIA are wrong.
Another dumb Alignment idea.
Any one crude heuristic will be goodhearted, but what about a pile of crude heuristics.
A bunch of humans have say, 1 week in a box to write a crude heuristic for a human value function (bounded on [0,1] )
Before they start, an AI is switched on, given a bunch of info, and asked to predict a probability distribution over what the humans write.
Then an AI maximizes the average over that distribution.
The humans in the box know the whole plan. They can do things like flip a quantum coin, and use that to decide which part of their value function they write down.
Do all the mistakes cancel out? Is it too hard to goodheart all the heuristics in a way that's still bad? Can we write any small part of our utility function?
I would expect that more notable events would tend to get more sim time.
It might or might not be hard to sim one person without surrounding social context.
(Ie maybe humans interact in such completed ways that it's easiest to just sim all 8 billion. )
But the main point is that you are still extremely special, compared to a random member of a 10^50 person galactic civilization.
You aren't maximally special, but are still bloomin special.
You aren't looking at just how tiny our current world is on the scale of a billion dyson spheres.
If we scale up the resource scales from here to K3 without changing the distribution of things people are interested in, then everything anyone has bothered to say or think ever would get (I think like at least 10 ) orders of magnitude more compute than needed to simulate our civilization up to this point.
"Particularly interesting" in a sense in which all humans currently on earth (or in our history) are unusually interesting. It's that compared to the scale of the universe, simulating pre singularity history doesn't take much.
I don't know the amount of compute needed, but I strongly suspect it's <1 in 10^20 of the compute that fits in our universe.
In a world of 10^50 humans in a galaxy spanning empire, you are interesting just for being so early.
Ok, I will grant you the "simulations run slower/ with more energy" so are less common argument as approximately true.
(I think there are big caviats to that, and I think it would be possible to run a realistic sim of you for less than your metabolic power use of ~100 watts. And of course, giving you your exact experiences without cheating requires a whole universe of stars lit up, just so you can see some dots in an astronomy magazine.)
Imagine a universe with one early earth, and 10^50 minds in a intergalactic civilization, including a million simulations of early earth. (Amongst a billion sims of other stuff)
In this universe it is true both that most beings are in underlying reality, and that we are likely in a simulation.
This relies on us being unusually interesting to potential simulators.
If we pretend things like AI x-risk aren't a thing for a moment.
Then the balancing for the large utility for long termism is the small chance that we see ourselves being early.
Taking this as anthropic evidence that x-risk is high seems wrong.
I think the long termism is, to a large extent, held up on the strong evidence that we are at the begining. If I found out that 50% of people are living in ancestor simulations, the case for long termism would weaken a lot, we are probably in yet another sim.
There are a couple of potential solutions here.
One solution is computational personhood.
The human mind contains about 10^12 or whatever bits. So there are at most 2^10^12 minds we would recognize as human. If you think that simulating the exact same mind 10 times is no better than simulating it once, and you deny the moral relevance of vast incomprehensible transhuman minds, then you have some finite bounds on your utility. Some finite set of things that might or might not be simulated. This lets you just deny that 3^^^^3 distinct humans exist in the platonic space of possible minds.
The other solution is solomonov reality fluid. Reality has some magic reality fluid that determines how real something is. A bigger universe doesn't get more realness. It just spreads the same realness out more widely.
When you see a quantum coin get tossed, you split into 2 versions, but each of those versions is half as real. This removes any incentive to delay pleasurable experiences until after you see a quantum coin flip.
Ie otherwise, eating an icecream and then seeing 100 digits of quantum randomness would mean experiencing eating that icecream once, and seeing the randomness first would mean the universe being split into 10^100 versions of you, each of which enjoys their own ice cream. So unless you feel compelled to read pages of quantum random numbers before doing anything fun, you must be splitting up your realness between the quantum many worlds.
If you don't split the realness in your probability distribution, you are constantly surprised how little quantum randomness you see. Ie suppose there is a 1 in 100 chance of me putting 50 digits of quantum randomness into this post. And you see I don't. 1%*10^50 =10^48, meaning your surprise, that I didn't add that randomness should be 1 in 10^48 if you consider all the worlds with different numbers equally real.
Now probability distribution realness doesn't have to be the same as moral realness. There are consistent models of philosophy where these are different. But it actually works out fine if those are the same.
So if we live in a universe with a vast number of people, that universe has to split it's realness among the people in it. Ie if there are 3^^3 people, most of them get < 1/3^^3 measure, making them almost entirely imaginary.
I think both are wrong.
I'm in favor of the complexity location hypothesis.
A hypothesis needs to describe the universe, and point you out within it, and it uses occam's razor for both.
This means you should assign a high probability to finding yourself in a special position, ie one easy to describe.
If the hypothesis are 1 red shirt, or 1 red and 3^^3 blue shirts. Then observing a red shirt is modest evidence towards the former position. And if you find yourself in the latter world, your probability of being the red shirt is determined by the length of "your the red shirt", not by the number of blue shirts. (Although if there weren't many blue shirts, "your on the left" or "you have the biggest feet" might also locate you, giving higher total probability.)
An "everyone gets a share" system has the downside that if 0.1% of people want X to exist, and 95% of people strongly want X not to exist, then the 0.1% can make X in their share.
Where X might be torturing copies of a controversial political figure. Or violent video games with arguably sentient AI opponents getting killed.
Also, I think you are passing the buck a lot here. Instead of deciding what to do with the universe, you now need to decide how to massively upgrade a bunch of humans into the sort of beings who can decide that.
Also, some people just dislike responsibility.
And the modifications needed to make a person remotely trustworthy to that level are likely substantial. Perhaps. How much do you need to overwrite everyones mind with a FAI? I don't know.
There are practical anti-occam calculations. Start uniform over all bitstrings. And every time you find a short program that produces a bitstring, turn the probability of that bitstring down.