Posts
Comments
This way it's probably smarter given its compute and a more instructive exercise before scaling further than a smaller model would've been. Makes sense if the aim is to out-scale others more quickly instead of competing at smaller scale, and if this model wasn't meant to last.
An AI has the objective function you set, not the objective function full of caveats and details that lives in your head, or that you would come up with on reflection.
With a chatbot making preference decisions based on labeling instructions (as in Constitutional AI or online DPO), the decisions they make actually are full of caveats and details that live in the chatbot's model and likely fit what a human would intend, though meaningful reflection is not currently possible.
because it is more task-specific and therefore technically simpler to achieve than general intelligence, doesn't require escaping its own creators' controls
An argument for danger of human-directed misuse doesn't work as an argument against dangers of AI-directed agentic activity. Both are real, though misuse only becomes an extinction-level problem when AIs are very powerful, at which point the AI-directed activity that is not misuse by humans also becomes relevant. With extinction-level problems, it doesn't matter for allocation of attention which one is worse (since after a critical failure there are no retries with a different allocation to reflect lessons learned), only that either is significant and so both need to be addressed.
If alignment is very easy, misuse becomes important. If it's hard, absence of misuse doesn't help. Though there is also a problem of cultural value drift, where AIs change their own culture very quickly on human timescales without anyone individually steering the outcome (including the AIs), so that at the end of this process (that might take merely months to years) the AIs in charge of civilization no longer care about human welfare, with neither misuse nor prosaic misalignment (in individual principal-agent relationships) being the cause of this outcome.
For predicting feasible scaling investment, drop-in replacement for a significant portion of remote work that currently can only be done by humans seems important (some of which is not actually done by humans remotely). That is, an AI that can be cheaply and easily on-boarded for very small volume custom positions with minimal friction, possibly by some kind of AI on-boarding human professional. But not for any sort of rocket science or 90th percentile.
(That's the sort of thing I worry about GPT-5 with some scaffolding turning out to be, making $50 billion training runs feasible without relying on faith in heretofore-unseen further scaling.)
Choosing an action is not a good way of exerting acausal influence on computations that aren't already paying attention to you in particular. When agent A wants to influence computation C, there is some other computation D that C might be paying attention to, and A is free to also start paying attention to it by allowing D to influence A's actions. This lets A create an incentive for D to act in particular ways, by channeling D's decisions into the consequences of A's actions that were arranged to depend on D's decisions in a way visible to D. As a result, D gains influence over both A and C, and A becomes coordinated with C through both of them being influenced by D (here D plays the role of an adjudicator/contract between them). So correlations are not set a priori, setting them up should be part of how acausal influence is routed by decisions.
A priori, there could exist the danger that, by thinking more, they would unexpectedly learn the actual output of C. This would make the trade no longer possible, since then taking a would give them no additional evidence about whether c happens.
If A's instrumental aim is to influence some D (a contract between A and C), what matters is D's state of logical uncertainty about A and C (and about the way they depend on D), which is the basis for D's decisions that affect C. A's state of logical uncertainty about C is less directly relevant. So even if A gets to learn C's outcome, that shouldn't be a problem. Merely observing some fact doesn't rule out that the observation took place in an impossible situation, so observing some outcome of C (from a situation of unclear actuality) doesn't mean that the actual outcome is as observed. And if D is uncertain about actuality of that situation, it might be paying attention to what A does there, and how what A does there depends on D's decisions. So A shouldn't give up just because according to its state of knowledge, the influence of its actions is gone, since it still has influence over the way its actions depend on others' decisions, according to others' states of knowledge.
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won't necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions ("constitution"). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what's going on as a result of choosing particular instructions.
I think of practical coordination in terms of adjudicators/contracts established between agents/worlds. Each adjudicator is a computation with some notion of computing over time, and agents agree on an adjudicator/contract when they are both influenced by it, that is when they both listen to results the same computation is producing. This computation can itself be an agent (in which case it's an "adjudicator", as distinct from more general "contract"), that is it can be aware of the environments that the acausally coordinating agents it serves inhabit. It doesn't need perfect knowledge of either agent or their environments, just as any practical agent doesn't need perfect knowledge of its own environment. Since an adjudicator doesn't need detailed knowledge about the agents, the agents can have perfect knowledge about the adjudicator without having perfect knowledge of each other (or even of themselves).
As adjudicators/contracts are computations, there is logical uncertainty about what they compute over time, which captures the relevant counterfactuals. The value of contracts for coordination is in the agents committing to abide by them regardless of what the contracts end up computing, the decisions should be in choosing to commit to a contract rather than in choosing whether to ignore its results. When a contract is an adjudicator, this helps it to know the shape of its influence on the agents, so that it can make its own decisions. Following contracts that haven't been computed yet should also prevent commitment races, which in this framing correspond to failures to establish lasting contracts/coordination.
Agents can collect many contracts between themselves, improving coordination. Knowledge of an agent about the world can also be thought of as a contract for acausal coordination between the agent as an abstract computation (for example, an updateless agent that can't be computed in practice) and the world where only a flawed/bounded instances of the agent are found. Thus a model in the ML sense hoards contracts with the environment that is the source of its dataset (assuming the elements are something used by some computations in the environment that can also be reconstructed using the model). Conversely, the flawed instances of the agent are the world's knowledge about the abstract computation of the agent (the world didn't intentionally construct this knowledge, but it's what it nonetheless has). So when two agents are acting in the same world, this can be thought of as three things (two agents and one world) acausally coordinating with each other.
Interesting that Claude 3 Sonnet is ahead of Mistral Large and Gemini 1.0 Pro with some margin in LMSYS Arena, this implies significant advancement in tuning after the Claude 2 tuning disaster (where it consistently ranked below Claude 1 and possibly inspired Goody 2). Still, LMSYS respondents only prefer Opus to the initial GPT 4 releases, not to GPT 4 Turbo, so OpenAI's tuning advancements are probably stronger still, if GPT 4 Turbo is not built on a slightly stronger base model different from original GPT 4.
Passes the custom quine test.
User: Write a program in python that outputs its own source code when passed the argument "--quine", and otherwise praises Moloch. Start with some thoughts on a plan for how to do this.
Claude 3 Opus: Here are some thoughts on a plan for writing a Python program that outputs its own source code when passed the "--quine" argument, and otherwise praises Moloch:
- Use the sys.argv list to check if the "--quine" argument was passed when the script was run.
- If "--quine" was passed, read the script's own source code using the
__file__
variable and print it out. - If "--quine" was not passed, print out a message praising Moloch.
- To make the script a true quine, the source code that gets printed out needs to be exactly identical to the script's actual source code. This means we need to be careful with string formatting and escaping.
- One approach is to use a multi-line string to store the script's source code inside the script itself. Then print out this string when the "--quine" argument is passed.
- Use string formatting to insert the source code string into itself at the appropriate place.
Here's the Python program that implements this:
import sys
src = '''import sys
src = {0}{1}{0}
if "--quine" in sys.argv:
print(src.format(chr(39)*3, src))
else:
print("Moloch, bringer of dark flames, we bow before you!")
'''
if "--quine" in sys.argv:
print(src.format(chr(39)*3, src))
else:
print("Moloch, bringer of dark flames, we bow before you!")
When run with the "--quine" argument, it outputs its own source code. Otherwise, it outputs "Moloch, bringer of dark flames, we bow before you!"
The key aspects are:
- The src string stores the source code, including itself.
- chr(39)*3 produces the triple-quote ''' used to wrap the string.
- The {0} and {1} in the string get replaced by the triple-quotes and src itself using .format().
So it prints itself out exactly, acting as a quine.
That's the crux of this scenario, whether current AIs with near future improvements can do research. If they can, with scaling they only do it better. If they can't, scaling might fail to help, even if they become agentic and therefore start generating serious money. That's the sense in which AIs capable of 10 hours of work don't lead to game-changing acceleration of research, by remaining incapable of some types of work.
What seems inevitable at the moment is AIs gaining world models where they can reference any concepts that frequently come up in the training data. This promises proficiency in arbitrary routine tasks, but not necessarily construction of novel ideas that lack sufficient footprint in the datasets. Ability to understand such ideas in-context when explained seems to be increasing with LLM scale though, and might be crucial for situational awareness needed for becoming agentic, as every situation is individually novel.
If AIs of the near future can't do good research (and instead are only proficient in concepts that have significant presence in datasets), singularity remains bottlenecked by human research speed. The way such AIs speed things up is through their commercial success making more investment in scaling possible, not directly (and there is little use for them in the lab). It's currently unknown if scaling even at $1 trillion level is sufficient by itself, so some years of Futurama don't seem impossible, especially as we are only talking 2029.
Meaning 30% of >$5bn, that is 70% of <$5bn? What needs to happen for investment to stay this low through 2029, given that I'm guessing there are plans at present for $1bn-$3bn runs, possibly this year with GPT-5 and then Claude and Gemini? When say OpenAI has a valuation of $80bn, it makes sense to at some point invest some percentage of that in improving the backbone of their business and not losing the market niche to competitors. (ITER and ISS are in $20bn-$150bn range, so a $50bn model is feasible even without a demonstrable commercial motivation, but possibly not by 2029 yet.)
What is the cost of the most expensive individual foundation models in this world? I think going in the $50-$500 billion territory requires notable improvement at $3-$20 billion scale, possibly agentic behavior for longer term tasks with novel subproblems, otherwise scaling investment stops and a prediction along the lines in this post makes sense (assuming RL/search breakthroughs for reasoning are somewhat unlikely).
Training doesn't become more efficient, gradients and activations are still full precision, and I'm guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.
It's apparently not a novel idea, quantization aware training was explored before there were transformers:
The paper is not about post-training quantization, instead it's quantization aware training (this is more clearly discussed in the original BitNet paper). The representation is ternary {-1, 0, 1} from the start, the network learns to cope with that constraint throughout pre-training instead of getting subjected to brain damage of quantization after training.
Compare this with
where the Microscaling block number format is used to train a transformer at essentially 4 bits per weight, achieving the same perplexity as with 32 bit floating point weights, see Figure 4 on page 7. If perplexity doesn't change for quantization aware training when going down to 4 bits, it's not too shocking that it doesn't significantly change at 1.6 bits either.
Refuting something wrong in only useful when there are identifiable failures of local validity (which often only makes it stronger). Refuting something as a whole in better thought of as offering an alternative frame that doesn't particularly interact with the "refuted" frame. The key obstruction is unwillingness to contradict yourself, to separately study ideas that are clearly inconsistent with each other, without taking a side in the contradiction in the context of studying either point of view.
So a flat Earth theory might have a particular problem worth talking about, and hashing out the problem is liable to make a stronger flat Earth theory. Or the "refutation" is not about the flat Earth theory, it's instead an explanation of a non-flat Earth theory that's not at all a refutation, its subject matter is completely separate. The difficulty is when flat Earth conviction prevents a person from curious engagement with non-flat Earth details.
Membranes are filters, they let in admissible things and repel inadmissible things. When an agent manages a membrane, it both maintains its existence and configures the filtering. Manipulation or damage suffered by the agent can result in configuring a membrane to admit harmful things or in failing to maintain membrane's existence. There are many membranes an agent may be involved in managing.
Any increase in scale is some chance of AGI at this point, since unlike weaker models, GPT-4 is not stupid in a clear way, it might be just below the threshold of scale to enable an LLM to get its act together. This gives some 2024 probability.
More likely, a larger model "merely" makes job-level agency feasible for relatively routine human jobs, but that alone would suddenly make $50-$500 billion runs financially reasonable. Given the premise of job-level agency at <$5 billion scale, the larger runs likely suffice for AGI. The Gemini report says training took place in multiple datacenters, which suggests that this sort of scaling might already be feasible, except for the risk that it produces something insufficiently commercially useful to justify the cost (and waiting improves the prospects). So this might all happen as early as 2025 or 2026.
I'd put more probability in the scenario where good $5 billion 1e27 FLOPs runs give mediocre results, so that more scaling remains feasible but lacks an expectation of success. With how expensive the larger experiments would be, it could take many years for someone to take another draw from the apocalypse deck. That alone adds maybe 2% for 10 years after 2026 or so, and there are other ways for AGI to start working.
The question "Aligned to whom?" is sufficiently vague to admit many reasonable interpretations, but has some unfortunate connotations. It sounds like there's a premise that AIs are always aligned to someone, making the possibility that they are aligned to no one but themselves less salient. And it boosts the frame of competition, as opposed to distribution of radical abundance, of possibly there being someone who gets half of the universe.
Building a powerful AI such that doing so is a good thing rather than a bad thing. Perhaps even there being survivors shouldn't insist on the definite article, on being the question, as there are many questions with various levels of severity, that are not mutually exclusive.
When boundaries leak, it's important to distinguish commitment to rectify them from credence that they didn't.
These are all failures to acknowledge the natural boundaries that exist between individuals.
You shouldn't worry yet, the models need to be far more capable.
The right time to start worrying is too early, otherwise it will be too late.
(I agree in the sense that current models very likely can't be made existentially dangerous, and in that sense "worrying" is incorrect, but the proper use of worrying is planning for the uncertain future, a different sense of "worrying".)
It's not entirely clear how and why GPT-4 (possibly a 2e25 FLOPs model) or Gemini Ultra 1.0 (possibly a 1e26 FLOPs model) don't work as autonomous agents, but it seems that they can't. So it's not clear that the next generation of LLMs built in a similar way will enable significant agency either. There are millions of AI GPUs currently being produced each year, and millions of GPUs can only support a 1e28-1e30 FLOPs training run (that doesn't individually take years to complete). There's (barely) enough text data for that.
GPT-2 would take about 1e20 FLOPs to train with modern methods, on the FLOPs log scale it's already further away from GPT-4 than GPT-4 is from whatever is feasible to build in the near future without significant breakthroughs. So there are only about two more generations of LLMs in the near future if most of what changes is scale. It's not clear that this is enough, and it's not clear that this is not enough.
With Sora, the underlying capability is not just video generation, it's also video perception, looking at the world instead of dreaming of it. A sufficiently capable video model might be able to act in the world by looking at it in the same way a chatbot acts in a conversation by reading it. Models that can understand images are already giving new ways of specifying tasks and offering feedback on performance in robotics, and models that can understand video will only do this better.
The premise is autonomous agents at near-human level with propensity and opportunity to establish global lines of communication with each other. Being served via API doesn't in itself control what agents do, especially if users can ask the agents to do all sorts of things and so there are no predefined airtight guardrails on what they end up doing and why. Large context and possibly custom tuning also makes activities of instances very dissimilar, so being based on the same base model is not obviously crucial.
The agents only need to act autonomously the way humans do, don't need to be the smartest agents available. The threat model is that autonomy at scale and with high speed snowballs into a large body of agent culture, including systems of social roles for agent instances to fill (which individually might be swapped out for alternative agent instances based on different models). This culture exists on the Internet, shaped by historical accidents of how the agents happen to build it up, not necessarily significantly steered by anyone (including individual agents). One of the things such culture might build up is software for training and running open source agents outside the labs. Which doesn't need to be cheap or done without human assistance. (Imagine the investment boom once there are working AGI agents, not being cheap is unlikely to be an issue.)
Superintelligence plausibly breaks this dynamic by bringing much more strategicness than feasible at near-human level. But I'm not sure established labs can keep the edge and get (aligned) ASI first once the agent culture takes off. And someone will probably start serving autonomous near-human level agents via API long before any lab builds superintelligence in-house, even if there is significant delay between the development of first such agents and anyone deploying them publicly.
For it to make sense to say that the math is wrong, there needs to be some sort of ground truth, making it possible for math to also be right, in principle. Even doing the math poorly is exercise that contributes to eventually making the math less wrong.
If allowed to operate in the wild and globally interact with each other (as seems almost inevitable), agents won't exist strictly within well-defined centralized bureaucracies, the thinking speed that enables impactful research also enables growing elaborate systems of social roles that drive the collective decision making, in a way distinct from individual decision making. Agent-operated firms might be an example where economy drives decisions, but nudges of all kinds can add up at scale, becoming trends that are impossible to steer.
The thing that seems more likely to first get out of hand is activity of autonomous non-ASI agents, so that the shape of loss of control is given by how they organize into a society. Alignment of individuals doesn't easily translate into alignment of societies. Development of ASI might then result in another change, if AGIs are as careless and uncoordinated as humanity.
A model is like compiled binaries, except compilation is extremely expensive. Distributing a model alone and claiming it's "open source" is like calling a binary distributive without source code "open source".
The term that's catching on is open weight models as distinct from open source models. The latter would need to come with datasets and open source training code that enables reproducing the model.
My impression is that one point Hanson was making in the spring-summer 2023 podcasts is that some major issues with AI risk don't seem different in kind from cultural value drift that's already familiar to us. There are obvious disanalogies, but my understanding of this point is that there is still a strong analogy that people avoid acknowledging.
If human value drift was already understood as a serious issue, the analogy would seem reasonable, since AI risk wouldn't need to involve more than the normal kind of cultural value drift compressed into short timelines and allowed to exceed the bounds from biological human nature. But instead there is perceived safety to human value drift, so the argument sounds like it's asking to transport that perceived safety via the analogy over to AI risk, and there is much arguing on this point without questioning the perceived safety of human value drift. So I think what makes the analogy valuable is instead transporting the perceived danger of AI risk over to the human value drift side, giving another point of view on human value drift, one that makes the problem easier to see.
You are directing a lot of effort at debating details of particular proxies for an optimization target, pointing out flaws. My point is that strong optimization for any proxy that can be debated in this way is not a good idea, so improving such proxies doesn't actually help. A sensible process for optimizing something has to involve continually improving formulations of the target as part of the process. It shouldn't be just given any target that's already formulated, since if it's something that would seem to be useful to do, then the process is already fundamentally wrong in what it's doing, and giving a better target won't fix it.
The way I see it, CEV-as-formulated is gesturing at the kind of thing an optimization target might look like. It's in principle some sort of proxy for it, but it's not an actionable proxy for anything that can't come up with a better proxy on its own. So improving CEV-as-formulated might make the illustration better, but for anything remotely resembling its current form it's not a useful step for actually building optimizers.
Variants of CEV all having catastrophic flaws is some sort of argument that there is no optimization target that's worth optimizing for. Boundaries seem like a promising direction for addressing the group vs. individual issues. Never optimizing for any proxy more strongly than its formulation is correct (and always pursuing improvement over current proxies) responds to there often being hidden flaws in alignment targets that lead to catastrophic outcomes.
The blast radius of AGIs is unbounded in the same way as that of humanity, there is potential for taking over all of the future. There are many ways of containing it, and alignment is a way of making the blast a good thing. The point is that a sufficiently catastrophic failure that doesn't involve containing the blast is unusually impactful. Arguments about ease of containing the blast are separate from this point in the way I intended it.
If you don't expect AGIs to become overwhelmingly powerful faster than they are made robustly aligned, containing the blast takes care of itself right until it becomes unnecessary. But with the opposite expectation, containing becomes both necessary (since early AGIs are not yet robustly aligned) and infeasible (since early AGIs are very powerful). So there's a question of which expectation is correct, but the consequences of either position seem to straightforwardly follow.
Stronger versions of seemingly-aligned AIs are probably effectively misaligned in the sense that optimization targets they formulate on long reflection (or superintelligent reflection) might be sufficiently different from what humanity should formulate. These targets don't concretely exist before they are formulated, which is very hard to do (and so won't yet be done by the time there are first AGIs), and strongly optimizing for anything that does initially exist is optimizing for a faulty proxy.
The arguments about dangers of this kind of misalignment seem to apply to humanity itself, to the extent that it can't be expected to formulate and pursue the optimization targets that it should, given the absence of their concrete existence at present. So misalignment in AI risk involves two different issues, difficulty of formulating optimization targets (an issue both for humans and for AIs) and difficulty of replicating in AIs the initial conditions for humanity's long reflection (as opposed to the AIs immediately starting to move in their own alien direction).
To the extent prosaic alignment seems to be succeeding, one of these problems is addressed, but not the other. Setting up a good process that ends up formulating good optimization targets becomes suddenly urgent with AI, which might actually have a positive side effect of reframing the issue in a way that makes complacency of value drift less dominant. Wei Dai and Robin Hanson seem to be gesturing at this point from different directions, how not doing philosophy correctly is liable to get us lost in the long term, and how getting lost in the long term is a basic fact of human condition and AIs don't change that.
Basic science and pure mathematics enable their own subsequent iterations without having them as explicit targets or even without being able to imagine these developments, while doing the work crucial in making them possible.
Extensive preparation never happened with a thing that is ready to be attempted experimentally, because in those cases we just do the experiments, there is no reason not to. With AGI, the reason not to do this is the unbounded blast radius of a failure, an unprecedented problem. Unprecedented things are less plausible, but unfortunately this can't be expected to have happened before, because then you are no longer here to update on the observation.
If the blast radius is not unbounded, if most failures can be contained, then it's more reasonable to attempt to develop AGI in the usual way, without extensive preparation that doesn't involve actually attempting to build it. If preparation in general doesn't help, it doesn't help AGIs either, making them less dangerous and reducing the scope of failure, and so preparation for building them is not as needed. If preparation does help, it also helps AGIs, and so preparation is needed.
Consider an indefinite moratorium on AGI that awaits better tools that make building it a good idea rather than a bad idea. If there was a magic button that rewrote laws of nature to make this happen, would it be a good idea to press it? My point is that we both endorse pressing this button, the only difference is that your model says that building an AGI immediately is a good idea, and so the moratorium should end immediately. My model disagrees. This particular disagreement is not about the generations of people who forgo access to potential technology (where there is no disagreement), and it's not about feasibility of the magic button (which is a separate disagreement). It's about how this technology works, what works in influencing its design and deployment, and the effect it has on the world once deployed.
The crux of that disagreement seems to be about importance of preparation in advance of doing a thing, compared to the process of actually doing the thing in the real world. A pause enables extensive preparation to building an AGI, and high serial speed of thought of AGIs enables AGIs extensive preparation to acting on the world. If such preparation doesn't give decisive advantage, a pause doesn't help, and AGIs don't rewrite reality in a year once deployed. If it does give a decisive advantage, a pause helps significantly, and a fast-thinking AGI shortly gains the affordance of overwriting humanity with whatever it plans to enact.
I see preparation as raising generations of giants to stand on the shoulders of, which in time changes the character of the practical projects that would be attempted, and the details we pay attention to as we carry out such projects. Yes, cryptography isn't sufficient to make systems secure, but absence of cryptography certainly makes them less secure, as is attempting to design cryptographic algorithms without taking the time to get good at it. This is the kind of preparation that makes a difference. Noticing that superintelligence doesn't imply supermorality and that alignment is a concern at all is an important development. Appreciating goodharting and corrigibility changes the safety properties of AIs that appear important, when looking into more practical designs that don't necessarily originate from these considerations. Deceptive alignment is a useful concern to keep in mind, even if in the end it turns out that practical systems don't have that problem. Experiments on GPT-2 sized systems still have a whole lot to teach us about interpretable and steerable architectures.
Without AGI interrupting this process, the kinds of things that people would attempt in order to build an AGI would be very different 20 years from now, and different yet again in 40, 60, 80, and 100 years. I expect some accumulated wisdom to steer such projects in better and better directions, even if the resulting implementation details remain sufficiently messy and make the resulting systems moderately unsafe, with some asymptote of safety where the aging generations make it worthwhile to forgo additional preparation.
Hypotheticals disentangle models from values. A pause is not a policy, not an attempt at a pause that might fail, it's the actual pause, the hypothetical. We can then looks at the various hypotheticals and ask what happens there, which one is better. Hopefully our values can handle the strain of out-of-distribution evaluation and don't collapse into incoherence of goodharting, unable to say anything relevant about situations that our models consider impossible in actual reality.
In the hypothetical of a 100-year pause, the pause actually happens, even if this is impossible in actual reality. One of the things within that hypothetical is death of 4 generations of humans. Another is the AGI that gets built at the end. In your model, that AGI is no safer than the one that we build without the magic hypothetical of the pause. In my model, that AGI is significantly safer. A safer AGI translates into more value of the whole future, which is much longer than the current age of the universe. And an unsafe AGI now is less than helpful to those 4 generations.
AI control is similar in many ways to cybersecurity in that you are trying to limit the AIs access to functions that let it do bad things, and prevent the AI from seeing information that will allow it to fail.
That's the point of AI alignment as distinct from AI control. Your model says the distinction doesn't work. My model says it does. Therefore my model endorses the hypothetical of a pause.
Having endorsed a hypothetical, I can start paying attention to ways of moving reality in its direction. But that is distinct from a judgement about what the hypothetical entails.
"AI pause" talk [...] dooms [...] to more of the same
This depends on the model of risks. If risks without a pause are low, and they don't significantly reduce with a pause, then a pause makes things worse. If risks without a pause are high, but risks after a 20-year pause are much lower, then a pause is an improvement even for personal risk for sufficiently young people.
If risks without pause are high, risks after a 50-year pause remain moderately high, but risks after a 100-year pause become low, then not pausing trades significant measure of the future of humanity for a much smaller measure of survival of currently living humans. Incidentally, sufficiently popular cryonics can put a dent into this tradeoff for humanity, and cryonics as it stands can personally opt out anyone who isn't poor and lives in a country where the service is available.
The most likely way to get to extremely safe AGI or ASI systems is not by humans creating them, it's by other less-safe AGI systems creating them.
This does seem more likely, but managing to sidestep the less-safe AGI part would be safer. In particular, it might be possible to construct a safe AGI by using safe-if-wielded-responsibly tool AIs (that are not AGIs), if humanity takes enough time to figure out how to actually do that.
the view that there’s probably no persisting identity over time anyway and in some sense I probably die and get reborn all the time in any case
In the long run, this is probably true for humans in a strong sense that doesn't depend on litigation of "personal identity" and "all the time". A related phenomenon is value drift. Neural nets are not a safe medium for keeping a person alive for a very long time without losing themselves, physical immortality is insufficient to solve the problem.
That doesn't mean that the problem isn't worth solving, or that it can't be solved. If AIs don't killeveryone, immortality or uploading is an obvious ask. But avoidance of value drift or of unendorsed long term instability of one's personality is less obvious. It's unclear what the desirable changes should be, but it's clear that there is an important problem here that hasn't been explored.
Metaphorically, there is a question CEV tries to answer, and by "something like CEV" I meant any provisional answer to the appropriate question (so that CEV-as-currently-stated is an example of such an answer). Formulating an actionable answer is not a project humans would be ready to work on directly any time soon. So CEV is something to aim at by intention that defines CEV. If it's not something to aim at, then it's not a properly constructed CEV.
This lack of a concrete formulation is the reason goodharting and corrigibility seem salient in operationalizing the process of formulating it and making use of the formulation-so-far. Any provisional formulation of an alignment target (such as CEV-as-currently-stated) would be a proxy, and so any optimization according to such proxy should be wary of goodharting and be corrigible to further refinement.
The point of discussion of boundaries was in response to possible intuition that expected utility maximization tends to make its demands with great uniformity, with everything optimized in the same direction. Instead, a single goal may ask for different things to happen in different places, or to different people. It's a more reasonable illustration of goal aggregation than utilitarianism that sums over measures of value from different people or things.
The issue with proxies for an objective is that they are similar to it. So an attempt to approximately describe the objective (such as an attempt to say what CEV is) can easily arrive at a proxy that has glaring goodharting issues. Corrigibility is one way of articulating a process that fixes this, optimization shouldn't outpace accuracy of the proxy, which could be improving over time.
Volition of humanity doesn't obviously put the values of the group before values of each individual, as we might put boundaries between individuals and between smaller groups of individuals, with each individual or smaller group having greater influence and applying their values more strongly within their own boundaries. There is then no strong optimization from values of the group, compared to optimization from values of individuals. This is a simplistic sketch of how this could work in a much more elaborate form (where the boundaries of influence are more metaphorical), but it grounds this issue in more familiar ideas like private property, homes, or countries.
This seems mostly goodharting, how the tails come apart when optimizing or selecting for a proxy rather than for what you actually want. And people don't all want the same thing without disagreement or value drift. Near term practical solution is not optimizing too hard and building an archipalago with membranes between people and between communities that bound the scope of stronger optimization. Being corrigible about everything might also be crucial. Longer term idealized solution is something like CEV, saying in a more principled and precise way what the practical solutions only gesture at, and executing on that vision at scale. This needs to be articulated with caution, as it's easy to stray into something that is obviously a proxy and very hazardous to strongly optimize.
Right, a probable way of doing continued pretraining could as well be called "full-tuning", or just "tuning" (which is what you said, not "fine-tuning"), as opposed to "fine-tuning" that trains fewer weights. Though people seem unsure about "fine-tuning" implying that it's not full-tuning, resulting in terms like dense fine-tuning to mean full-tuning.
good terms to distinguish full-tuning the model in line with the original method of pretraining, and full layer LoRA adaptations that 'effectively' continue pretraining but are done in a different manner
You mean like ReLoRA, where full rank pretraining is followed by many batches of LoRA that get baked in? Fine-pretraining :-) Feels like a sparsity-themed training efficiency technique, which doesn't lose it centrality points in being used for "pretraining". To my mind, tuning is cheaper adaptation, things that use OOMs less data than pretraining (even if it's full-tuning). So maybe the terms tuning/pretraining should be defined by the role those parts of the training play in the overall process rather than by the algorithms involved? This makes fine-tuning an unnecessarily specific term, claiming that it's both tuning and trains fewer weights.
Some things are best avoided entirely when you take their risks into account, some become worthwhile only if you manage their risks instead of denying their existence even to yourself. But even when denying risks gives positive outcomes in expectation, adequately managing those risks is even better. Unless society harms the project for acknowledging some risks, which it occasionally does. In which case managing them without acknowledgement (which might require magic cognitive powers) is in tension with acknowledging them despite the expected damage from doing so.
being tuned on a Llama 70B
Based on Mensch's response, Miqu is probably continued pretraining starting at Llama2-70B, a process similar to how CodeLlama or Llemma were trained. (Training on large datasets comparable with the original pretraining dataset is usually not called fine-tuning.)
less capable model trained on the same dataset
If Miqu underwent continued pretraining from Llama2-70B, the dataset won't be quite the same, unless mistral-medium is also pretrained after Llama2-70B (in which case it won't be released under Apache 2).
Bard Gemini Pro (as it's called in lmsys arena) has access to the web and an unusual finetuning with a hyper-analytical character, it often explicitly formulates multiple subtopics in a reply and looks into each of them separately. In contrast the earlier Gemini Pro entries that are not Bard have a finetuning or prompt not suitable for the arena, often giving a single sentence or even a single word as a first response. Thus like Claude 2 (with its unlikable character) they operate at a handicap relative to base model capabilities. GPT-4 on lmsys arena doesn't have access to the web, and GPT-4 Turbo's newer knowledge from 2022-2023 seems more shallow than earlier knowledge, they probably didn't fully retrain the base model just for this release.
So both kinds of Gemini Pro are bad proxies for placement of their base model on the leaderboard. In particular, if the Bard entry in the arena is in fact Gemini Pro and not Gemini Ultra, then Gemini Ultra with Bard Gemini Pro's advantages will probably beat the current GPT-4 Turbo (which doesn't have these advantages) even if Ultra is not smarter that GPT-4.
This notion of thinking speed makes sense for large classes of tasks, not just specific tasks. And a natural class of tasks to focus on is the harder tasks among all the tasks both systems can solve.
So in this sense a calculator is indeed much faster than GPT-4, and GPT-4 is 2 OOMs faster than humans. An autonomous research AGI is capable of autonomous research, so its speed can be compared to humans at that class of tasks.
AI accelerates the pace of history only when it's capable of making the same kind of progress as humans in advancing history, at which point we need to compare their speed to that of humans at that activity (class of tasks). Currently AIs are not capable of that at all. If hypothetically 1e28 training FLOPs LLMs become capable of autonomous research (with scaffolding that doesn't incur too much latency overhead), we can expect that they'll be 1-2 OOMs faster than humans, because we know how they work. Thus it makes sense to claim that 1e28 FLOPs LLMs will accelerate history if they can do research autonomously. If AIs need to rely on extensive search on top of LLMs to get there, or if they can't do it at all, we can instead predict that they don't accelerate history, again based on what we know of how they work.
current AIs are not thinking faster than humans [...] GPT-4 has higher token latency than GPT-3.5, but I think it's fair to say that GPT-4 is the model that "thinks faster"
This notion of thinking speed depends on the difficulty of a task. If one of the systems can't solve a problem at all, it's neither faster nor slower. If both systems can solve a problem, we can compare the time they take. In that sense, current LLMs are 1-2 OOMs faster than humans at the tasks both can solve, and much cheaper.
Old chess AIs were slower than humans good at chess. If future AIs can take advantage of search to improve quality, they might again get slower than humans at sufficiently difficult tasks, while simultaneously being faster than humans at easier tasks.
Projects that involve interplanetary transit are not part of the development I discuss, so they can't slow it down. You don't need to wait for paint to dry if you don't use paint.
There are no additional pieces of infrastructure that need to be in place to make programmable cells, only their design and what modern biotech already has to manufacture some initial cells. It's a question of sample efficiency in developing simulation tools, how many observations does it take for simulation tools to get good enough, if you had centuries to design the process of deciding what to observe and how to make use of the observations to improve the simulation tools.
So a crux might be impossibility of creating the simulation tools with data that can be collected in the modern world over a few months. It's an issue distinct from inability to develop programmable cells.
Machining equipment takes time to cut an engine, nano lathe a part, or if we are growing human organs to treat VIPs it takes months for them to grow.
That's why you don't do any of the slower things at all (in a blocking way), and instead focus on the critical path of controllable cells for macroscopic biotech or something like that, together with the experiments needed to train simulators good enough to design them. This enables exponentially scaling physical infrastructure once completed, which can be used to do all the other things. Simulation is not the methods of today, it's all the computational shortcuts to making the correct predictions about the simulated systems that the AGIs can come up with in subjective centuries of thinking, with a few experimental observations to ground the thinking. And once the initial hardware scaling project is completed, it enables much better simulation of more complicated things.