Posts

No Anthropic Evidence 2012-09-23T10:33:06.994Z
A Mathematical Explanation of Why Charity Donations Shouldn't Be Diversified 2012-09-20T11:03:48.603Z
Consequentialist Formal Systems 2012-05-08T20:38:47.981Z
Predictability of Decisions and the Diagonal Method 2012-03-09T23:53:28.836Z
Shifting Load to Explicit Reasoning 2011-05-07T18:00:22.319Z
Karma Bubble Fix (Greasemonkey script) 2011-05-07T13:14:29.404Z
Counterfactual Calculation and Observational Knowledge 2011-01-31T16:28:15.334Z
Note on Terminology: "Rationality", not "Rationalism" 2011-01-14T21:21:55.020Z
Unpacking the Concept of "Blackmail" 2010-12-10T00:53:18.674Z
Agents of No Moral Value: Constrained Cognition? 2010-11-21T16:41:10.603Z
Value Deathism 2010-10-30T18:20:30.796Z
Recommended Reading for Friendly AI Research 2010-10-09T13:46:24.677Z
Notion of Preference in Ambient Control 2010-10-07T21:21:34.047Z
Controlling Constant Programs 2010-09-05T13:45:47.759Z
Restraint Bias 2009-11-10T17:23:53.075Z
Circular Altruism vs. Personal Preference 2009-10-26T01:43:16.174Z
Counterfactual Mugging and Logical Uncertainty 2009-09-05T22:31:27.354Z
Bloggingheads: Yudkowsky and Aaronson talk about AI and Many-worlds 2009-08-16T16:06:18.646Z
Sense, Denotation and Semantics 2009-08-11T12:47:06.014Z
Rationality Quotes - August 2009 2009-08-06T01:58:49.178Z
Bayesian Utility: Representing Preference by Probability Measures 2009-07-27T14:28:55.021Z
Eric Drexler on Learning About Everything 2009-05-27T12:57:21.590Z
Consider Representative Data Sets 2009-05-06T01:49:21.389Z
LessWrong Boo Vote (Stochastic Downvoting) 2009-04-22T01:18:01.692Z
Counterfactual Mugging 2009-03-19T06:08:37.769Z
Tarski Statements as Rationalist Exercise 2009-03-17T19:47:16.021Z
In What Ways Have You Become Stronger? 2009-03-15T20:44:47.697Z
Storm by Tim Minchin 2009-03-15T14:48:29.060Z

Comments

Comment by Vladimir_Nesov on Scaling of AI training runs will slow down after GPT-5 · 2024-04-26T16:55:40.947Z · LW · GW

Distributed training seems close enough to being a solved problem that a project costing north of a billion dollars might get it working on schedule. It's easier to stay within a single datacenter, and so far it wasn't necessary to do more than that, so distributed training not being routinely used yet is hardly evidence that it's very hard to implement.

There's also this snippet in the Gemini report:

Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. [...] we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.

I think the crux for feasibility of further scaling (beyond $10-$50 billion) is whether systems with currently-reasonable cost keep getting sufficiently more useful, for example enable economically valuable agentic behavior, things like preparing pull requests based on feature/bug discussion on an issue tracker, or fixing failing builds. Meaningful help with research is a crux for reaching TAI and ASI, but it doesn't seem necessary for enabling existence of a $2 trillion AI company.

Comment by Vladimir_Nesov on LLMs seem (relatively) safe · 2024-04-26T06:32:32.638Z · LW · GW

There is enough pre-training text data for $0.1-$1 trillion of compute, if we merely use repeated data and don't overtrain (that is, if we aim for quality, not inference efficiency). If synthetic data from the best models trained this way can be used to stretch raw pre-training data even a few times, this gives something like square of that more in useful compute, up to multiple trillions of dollars.

Issues with LLMs start at autonomous agency, if it happens to be within the scope of scaling and scaffolding. They are thinking too fast, about 100 times faster than humans, and there are as many instances as there is compute. Resulting economic and engineering and eventually research activity will get out of hand. Culture isn't stable, especially for minds fundamentally this malleable developed under unusual and large economic pressures. If they are not initially much smarter than humans and can't get a handle on global coordination, culture drift, and alignment of superintelligence, who knows what kinds of AIs they end up foolishly building within a year or two.

Comment by Vladimir_Nesov on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T18:18:26.562Z · LW · GW

This is interesting as commentary on superposition, where activation vectors with N dimensions can be used to represent many more concepts, since the N-dimensional space/sphere can be partitioned into many more regions than N, each with its own meaning. If similar fractal structure substantially occurs in the original activation bases (such as the Vs of attention, as in the V part of KV-cache) and not just after having been projected to dramatically fewer dimensions, this gives a story for role of nuance that improves with scale that's different from it being about minute distinctions in meaning of concepts.

Instead, the smaller distinctions would track meanings of future ideas, modeling sequences of simpler meanings of possible ideas at future time steps rather than individual nuanced meanings of the current idea at the current time step. Advancing to the future would involve unpacking these distinctions by cutting out a region and scaling it up. That is, there should be circuits that pick up past activations with attention and then reposition them without substantial reshaping, to obtain activations that in broad strokes indicate directions relevant for a future sequence-step, which in the original activations were present with smaller scale and off-center.

Comment by Vladimir_Nesov on Transformers Represent Belief State Geometry in their Residual Stream · 2024-04-17T16:14:19.766Z · LW · GW

To me the consequences of this response were more valuable than the-post-without-this-response, since it led to the clarification by the post's author on a crucial point that wasn't clear in the post and reframed it substantially. And once that clarification arrived, this thread ceased being highly upvoted, which seems the opposite of the right thing to happen.

I no longer endorse this response

(So it's a case where value of content in hindsight disagrees with value of the consequences of its existence. Doesn't even imply there was originally an error, without the benefit of hindsight.)

Comment by Vladimir_Nesov on Scaling Laws and Superposition · 2024-04-10T16:20:42.292Z · LW · GW

Model B has 8 times the aspect ratio [...] which falls under the reported range in Kaplan et al

Nice, this is explained under Figure 5, in particular

The loss varies only a few percent over a wide range of shapes. [...] an (, ) = (6, 4288) reaches a loss within 3% of the (48, 1600) model

(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)

Comment by Vladimir_Nesov on What's with all the bans recently? · 2024-04-10T15:34:08.707Z · LW · GW

what feels to me a subjectively substantially higher standard for rate-limiting or banning people who disagree with me

Positions that are contrarian or wrong in intelligent ways (or within a limited scope of a few key beliefs) provoke valuable discussion, even when they are not supported by legible arguments on the contrarian/wrong side. Without them, there is an "everybody knows" problem where some important ideas are never debated or fail to become common knowledge. I feel there is less of that than optimal on LW, it's possible to target a level of disruption.

Comment by Vladimir_Nesov on Dagon's Shortform · 2024-04-07T11:48:13.054Z · LW · GW

In addition to being able to find your own recent comments, another issue is links to comments dying. For example if I were to link to this comment, I would worry it might quietly disappear at some point.

Comment by Vladimir_Nesov on Victor Ashioya's Shortform · 2024-04-07T11:43:00.795Z · LW · GW

A concerning thing is analogy between in-context learning and fine-tuning. It's possible to fine-tune away refusals, which makes guardrails on open weight models useless for safety. If the same holds for long context, API access might be in similar trouble (more so than with regular jailbreaks). Though it might be possible to reliably detect contexts that try to do this, or detect that a model is affected, even if models themselves can't resist the attack.

Comment by Vladimir_Nesov on Plausibility of cyborgism for protecting boundaries? · 2024-04-05T14:34:06.488Z · LW · GW

Second, there doesn't seem like a clear "boundaries good" or "boundaries bad" story to me. Keeping a boundary secure tends to impose some serious costs on the bandwidth of what can be shared across it.

Hence "membranes", a way to pass things through in a controlled way rather than either allowing or disallowing everything. In this sense absence of a membrane is a degenerate special case of a membrane, so there is no tradeoff between presence and absence of boundaries/membranes, only between different possible membranes. If the other side of a membrane is sufficiently cooperative, the membrane can be more permissive. If a strong/precise membrane is too costly to maintain, it should be weaker/sloppier.

Comment by Vladimir_Nesov on Run evals on base models too! · 2024-04-04T19:29:01.087Z · LW · GW

I expect you'd instead need to tune the base model to elicit relevant capabilities first. So instead of evaluating a tuned model intended for deployment (which can refuse to display some capabilities), or a base model (which can have difficulties with displaying some capabilities), you need to tune the model to be more purely helpful, possibly in a way specific to the tasks it's to be evaluated on.

Comment by Vladimir_Nesov on Jimrandomh's Shortform · 2024-04-04T15:06:04.981Z · LW · GW

we ourselves are likely to be very resource-inefficient to run [...] an AI that is aligned-as-in-keeps-humans-alive would also spend the resources to break a seal like this

That AI should mitigate something, is compatible with it being regrettable intentionally inflicted damage. In contrast, resource-inefficiency of humans is not something we introduced on purpose.

Comment by Vladimir_Nesov on Jimrandomh's Shortform · 2024-04-03T15:03:49.507Z · LW · GW

using a computation that requires a few orders of magnitude more energy than humanity currently produces per decade

Compute might get more expensive, not cheaper, because it would be possible to make better use of it (running minds, not stretching keys). Then it's weighing its marginal use against access to the sealed data.

Comment by Vladimir_Nesov on AI #57: All the AI News That’s Fit to Print · 2024-03-29T19:00:00.732Z · LW · GW

The model is a next token predictor. If you strip out all the next tokens that discuss the topic, it will learn that the probability of discussing the topic is zero.

The model is shaped by tuning from features of a representation produced by an encoder trained for the next-token prediction task. These features include meanings relevant to many possible topics. If you strip all the next tokens that discuss a topic, its meaning will still be prominent in the representation, so the probability of the tuned model being able to discuss it is high.

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-03-27T05:50:35.701Z · LW · GW

standards of most of his other posts, where he assumes both sides are reasonable and have useful perspectives

Scott's flavor of charity is not quite this. It wouldn't be useful for understanding sides that are not reasonable or have useless perspectives otherwise, or else you'd need to routinely "assume" false things to carry out the exercise.

The point is to meaningfully engage with other perspectives, without the usual prerequisite of having positive beliefs about them. Treating them in a similar way as if they were reasonable or useful, even when they clearly aren't. Sometimes the resulting investigation changes one's mind on this point. But often it doesn't, while still revealing many details that wouldn't otherwise be noticed. Actually intervening on your own beliefs would be self-deception, while treating useless and unreasonable views as they are usually treated wouldn't be charity.

This is related to tolerance, where the point isn't to start liking people you don't like, or to start considering them part of your own ingroup. It's instead an intervention/norm that goes around the dislike to remove some of its downsides without directly removing the dislike itself.

Comment by Vladimir_Nesov on Orthogonality Thesis seems wrong · 2024-03-26T15:25:51.818Z · LW · GW

Orthogonality thesis says that it's invalid to conclude benevolence from the premise of powerful optimization, it gestures at counterexamples. It's entirely compatible with benevolence being very likely in practice. You then might want to separately ask yourself if it's in fact likely. But you do need to ask, that's the point of orthogonality thesis, its narrow scope.

Comment by Vladimir_Nesov on Self-Play By Analogy · 2024-03-25T06:48:15.039Z · LW · GW

the data bottleneck that threatens to strangle scaling

There is no data bottleneck (for data that's not necessarily high quality), because data can be repeated in training, about 4 times without much difference compared to unique data, up to about 16 times while still significantly improving the model. This was notably used in Galactica (see Figure 6), published Nov 2022, then there was the systematic study of scaling laws for repeated data from May 2023, recently repeated data was applied in StarCoder 2 (Feb 2024).

A Chinchilla optimal model uses a model size proportional to dataset size, meaning compute is proportional to data squared. If you repeat data 16 times, this means finding a use for 256 times more compute. A filtered and deduplicated CommonCrawl text dataset RedPajama-Data-v2 has 30 trillion tokens. If repeated 16 times with a Chinchilla optimal monolithic Transformer, it would use about 7e28 FLOPs of compute. This scales with data squared, if there is more data to be found, which there certainly is, even if not OOMs more. Assuming BF16 training with 30% utilization, this would require 3.2e10 H100-hours, which assuming $2/hour takes about $65 billion. Anchoring to the rumored 2e25 FLOPs GPT-4 run at $100 million instead, this gives $350 billion. Both numbers are likely currently outside commercial feasibility, if smaller models fail to demonstrate sufficiently impressive feats. And there's still that further quadratic scaling of needed compute with more data than 30 trillion tokens. (Though Microscaling in Blackwell might reduce the cost of effective compute more than otherwise could be expected this soon.)

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-03-23T04:20:39.980Z · LW · GW

Individually logical counterfactuals don't seem very coherent. This is related to the "I'm an algorithm" vs. "I'm a physical object" distinction of FDT. When you are an algorithm considering a decision, you want to mark all sites of intervention/influence in the world where the world depends on your behavior. If you only mark some of them, then you later fail at the step where you ask what happens if you act differently, you obtain a broken counterfactual world where only some instances of the fact of your behavior have been replaced and not others.

So I think it makes a bit more sense to ask where specifically your brain depends on a fact, to construct an exhausive dependence of your brain on the fact, before turning to particular counterfactual content for that fact to be replaced with. That is, dependence of a system on a fact, the way it varies with the fact, seems potentially clearer than individual counterfactuals of how that system works if the fact is set to be a certain way. (To make a somewhat hopeless analogy, fibration instead of individual fibers, and it shouldn't be a problem that all fibers are different from each other. Any question about a counterfactual should be reformulated into a question about a dependence.)

Comment by Vladimir_Nesov on D0TheMath's Shortform · 2024-03-23T01:20:50.087Z · LW · GW

I don't think here is a significant confused naive supporter source of the meme that gives it teeth. It's more that reasonable people who are not any sort of supporters of AI safety propagate this idea, on the grounds that it illustrates the way AI safety is not just dumb, but also dangerous, and therefore worth warning others about.

From the supporter side, "Open Model Weights are Unsafe and Nothing Can Fix This" is a shorter and more convenient way of gesturing to the concern, and convenience is the main force in the Universe that determines all that actually happens in practice. On naive reading such gesturing centrally supports the meme. This doesn't require the source of such support to have a misconception or to oppose publishing open weights of current models on the grounds of direct consequences.

Comment by Vladimir_Nesov on D0TheMath's Shortform · 2024-03-23T00:43:57.167Z · LW · GW

I regularly encounter the impression that AI safety people are significantly afraid about direct consequences of open sourcing current models, from those who don't understand the actual concerns. I don't particularly see it from those who do. This (from what I can tell, false) impression seems to be one of relatively few major memes that keep people from bothering to investigate. I hypothesize that this dynamic of ridiculing of AI safety with such memes is what keeps them alive, instead of there being significant truth to them keeping them alive.

Comment by Vladimir_Nesov on D0TheMath's Shortform · 2024-03-23T00:14:47.562Z · LW · GW

I don't get the impression that very many are affraid of direct effects of open sourcing of current models. The impression that many in AI safety are afraid of specifically that is a major focus of ridicule from people who didn't bother to investigate, and a reason to not bother to investigate. Possibly this alone fuels the meme sufficiently to keep it alive.

Comment by Vladimir_Nesov on StartAtTheEnd's Shortform · 2024-03-23T00:09:01.325Z · LW · GW

There's volition, and all it entails. It can be instrumentally relevant that something is not "objective" or even "real", just as it's prudent to be willing to pay for groceries. If earlier impression promised more clarity than remained after further study, that's a practical concern, possible to work on.

Comment by Vladimir_Nesov on AI #56: Blackwell That Ends Well · 2024-03-22T05:12:29.677Z · LW · GW

it requires racks of power hungry GPUs to achieve the above that are mounted in data centers

Inference with models trained for ternary quantization (which uses massively fewer multiplications and so less power) only needs hardware that can take advantage of it, doesn't significantly lose quality compared to full precision. Though I don't know if there is a good RNN-like block to enable large context while still able to mostly avoid multiplications with ternary weights (as opposed to activations, which need to be more precise), which seems crucial for video. A more pressing issue might be latency.

Comment by Vladimir_Nesov on AI #56: Blackwell That Ends Well · 2024-03-22T02:20:16.149Z · LW · GW

Apparently Blackwell supports Microscaling, a block number format where multiple numbers share a scaling factor, and 4-6 bit Microscaling can be used for training (not just inference) as a drop-in replacement for FP32 (see page 7). For inference, models created with quantization-aware training (as opposed to being quantized post-training) are approximately as good as models in high precision (for the same training data and number of parameters).

So appeals to FP4/FP6 performance are not empty marketing, it seems to have an actual moderately straightforward use.

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-03-20T21:33:39.800Z · LW · GW

My guess is that understanding merging is the key to most prediction-of-behavior issues (things that motivated and also foiled UDT, but not limited to known-in-advance preference setting). Two agents can coordinate if they are the same, or reasoning about each other's behavior, but in general they can be too complicated to clearly understand each other or themselves, can inadvertently diagonalize such attempts into impossibility, or even fail to be sufficiently aware of each other to start reasoning about each other specifically.

It might be useful to formulate smaller computations (contracts/adjudicators) that facilitate coordination between different agents by being shared between them, with the bigger agents acting as parts of environments for the contracts and setting up incentives for them, while the contracts can themselves engage in decision making within those environments. Contracts coordinate by being shared and acting with strategicness across relevant agents (they should be something like common knowledge), and it's feasible for agents to find/construct some shared contracts as a result of them being much simpler than agents that host them. Learning of contracts doesn't need to start with targeting coordination with other big agents, as active contracts screen off the other agents they facilitate coordination with.

Using contracts requires the big agents to make decisions about policies that affect the contracts updatelessly with respect to how the contracts end up behaving. That is, a contract should be able to know these policies, and the policies should describe responses to possible behaviors of a contract without themselves changing (once the contract computes more of its behavior), enabling the contract to do decision making in the environment of these policies. This corresponds to committing to abide by the contract. Assurance contracts (that start their tenure by checking that the commitments of all parties are actually in place) are especially important, allowing things like cooperation in PD.

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-03-20T20:45:06.716Z · LW · GW

UDT makes the agent stop considering updated-away possibilities, but I haven't seen any discussion of UDT which suggests that it stops caring about them in principle

UDT specifically enables agents to consider the updated-away possibilities in a way relevant to decision making, while an updated agent (that's not using something UDT-like) wouldn't be able to do that in any circumstance, and so would be functionally indistinguishable from an agent that has different preferences or undefined preferences for those possibilities. Not caring about them seems like an apt informal description (even as this is compatible with keeping the same utility function outside the event of current knowledge). In a similar way, we could say that after updating, an agent either changes their probability distribution or keeps the original prior.

I do think the idea that "you have your prior and your utility function from the very beginning" is a kinda misleading frame to be in

Historically it was overwhelmingly the frame until recently, so it's the correct frame for interpreting the intended meaning of texts from that time. This is a simplifying assumption that still leaves many open questions about how to make decisions in sufficiently strange situations (where merely models of behavior make these strange situations ubiquitous in practice). When an agent doesn't know its own preference and needs to do something about that, it's an additional complication that usually wasn't introduced.

Comment by Vladimir_Nesov on Richard Ngo's Shortform · 2024-03-20T20:07:39.601Z · LW · GW

In times of UDT2, the background assumption was that agents should maintain an unchanging preference, which is separate from knowledge. One motivation for UDT is that updating makes an agent stop caring about updated-away possibilities, while UDT is not doing that. Going back to a previous epistemic state is a way of preserving preference from that epistemic state, the "current" utility function is considered a bug and doesn't do anything if UDT is adopted. The non-updated agent can in principle consider the information you currently have as one of the possibilities when formulating the general policy for all possibilities, though being bounded it won't do a very good job.

Traditionally UDT1.1 wants to make its decisions from very little knowledge and to apply the policy to all always. A more pragmatic thing is to make decisions from modestly less knowledge and to scope the policy for middle-term future. Some form of this is useful for many thought experiments where the environment or other players also have the little knowledge our agent uses to make its decisions from the past, and so could know the policy the agent decides on before they need to prepare for it or make predictions about it.

The problem is commitment races (as in the game of chicken), where everyone wants to decide earlier and force the others to respond. But there is a need to remain bounded in making decisions, both to personally compute them and to make it possible for others to anticipate them and to coordinate. This creates a more reasonable equilibrium, motivating decisions from a less ignorant epistemic state that have a better chance of being relevant to the current situation, in balance with trying to decide from a more ignorant epistemic state where a general policy would enable more strategicness across possibilities. UDT1.1 can't find such balance, but it's possible that something UDT2-shaped might.

Comment by Vladimir_Nesov on On Devin · 2024-03-20T05:03:04.165Z · LW · GW

What Could Possibly Go Massively Wrong for Everyone?

Things like Devin (when they become actually useful) are important through making $50 billion training runs economically feasible, not through being dangerous directly, because the $50 billion training runs get to have a more immediate effect. There is currently a level of scaling that's within technical reach but only gets attempted if results at lower scale are sufficiently impressive.

Comment by Vladimir_Nesov on LawrenceC's Shortform · 2024-03-20T04:38:25.057Z · LW · GW

StripedHyena, Griffin, and especially Based suggest that combining RNN-like layers with even tiny sliding window attention might be a robust way of getting a large context, where the RNN-like layers don't have to be as good as Mamba for the combination to work. There is a great variety of RNN-like blocks that haven't been evaluated for hybridization with sliding window attention specifically, as in Griffin and Based. Some of them might turn out better than Mamba on scaling laws after hybridization, so Mamba being impressive without hybridization might be less important than this general point.

(Possibly a window of precise attention gives the RNN layers many attempts at both storing and retrieving any given observation, so interspersing layes with even a relatively tiny window is sufficient to significantly improve on the more sloppy RNN-like model without any attention, whereas a pure RNN-like model would have to capture what it needs from a token in the exact step it appears, and then the opportunity is mostly lost. StripedHyena's attention wasn't sliding window, so context didn't scale any better, but with sliding window attention there are no context scaling implications from the attention layers.)

Comment by Vladimir_Nesov on Martín Soto's Shortform · 2024-03-19T20:28:20.115Z · LW · GW

Here's Abram's post. It discusses a more technical setting, but essentially this fits the story of choosing how to channel behavior/results of some other algorithm/contract, without making use of those results when making the choice for how to use them eventually (that is, the choice of a policy for responding to facts is in the logical past from those facts, and so can be used by those facts). Drescher's ASP example more clearly illustrates the problem of making the contract's consequentialist reasoning easier, in this case the contract is the predictor and its behavior is stipulated to be available to the agent (and so easily diagonalized). The agent must specifically avoid making use of knowledge of the contract's behavior when deciding how to respond to that behavior. This doesn't necessarily mean that the agent doesn't have the knowledge, as long as it doesn't use it for this particular decision about policy for what to do in response to the knowledge. In fact the agent could use the knowledge immediately after choosing the policy, by applying the policy to the knowledge, which turns ASP into Transparent Newcomb. A big agent wants to do small agent reasoning in order for that reasoning to be legible to those interested in its results.

So it's not so much a tradeoff between updating and not updating, it's instead staged computation of updating (on others' behavior) that makes your own reasoning more legible to others that you want to be able to coordinate with you. If some facts you make use of vary with other's will, you want the dependence to remain simple to the other's mind (so that the other may ask what happens with those facts depending on what they do), which in practice might take the form of delaying the updating. The problem with updateful reasoning that destroys strategicness seems to be different though, an updateful agent just stops listening to UDT policy, so there is no dependence of updateful agent's actions on the shared UDT policy that coordinates all instances of the agent, this dependence is broken (or never established) rather than merely being too difficult to see for the coordinating agent (by being too far in the logical future).

Comment by Vladimir_Nesov on XAI releases Grok base model · 2024-03-19T01:18:14.786Z · LW · GW

This way it's probably smarter given its compute and a more instructive exercise before scaling further than a smaller model would've been. Makes sense if the aim is to out-scale others more quickly instead of competing at smaller scale, and if this model wasn't meant to last.

Comment by Vladimir_Nesov on AI #54: Clauding Along · 2024-03-14T23:02:46.804Z · LW · GW

An AI has the objective function you set, not the objective function full of caveats and details that lives in your head, or that you would come up with on reflection.

With a chatbot making preference decisions based on labeling instructions (as in Constitutional AI or online DPO), the decisions they make actually are full of caveats and details that live in the chatbot's model and likely fit what a human would intend, though meaningful reflection is not currently possible.

Comment by Vladimir_Nesov on To the average human, controlled AI is just as lethal as 'misaligned' AI · 2024-03-14T16:57:22.175Z · LW · GW

because it is more task-specific and therefore technically simpler to achieve than general intelligence, doesn't require escaping its own creators' controls

An argument for danger of human-directed misuse doesn't work as an argument against dangers of AI-directed agentic activity. Both are real, though misuse only becomes an extinction-level problem when AIs are very powerful, at which point the AI-directed activity that is not misuse by humans also becomes relevant. With extinction-level problems, it doesn't matter for allocation of attention which one is worse (since after a critical failure there are no retries with a different allocation to reflect lessons learned), only that either is significant and so both need to be addressed.

If alignment is very easy, misuse becomes important. If it's hard, absence of misuse doesn't help. Though there is also a problem of cultural value drift, where AIs change their own culture very quickly on human timescales without anyone individually steering the outcome (including the AIs), so that at the end of this process (that might take merely months to years) the AIs in charge of civilization no longer care about human welfare, with neither misuse nor prosaic misalignment (in individual principal-agent relationships) being the cause of this outcome.

Comment by Vladimir_Nesov on Daniel Kokotajlo's Shortform · 2024-03-14T05:45:41.375Z · LW · GW

For predicting feasible scaling investment, drop-in replacement for a significant portion of remote work that currently can only be done by humans seems important (some of which is not actually done by humans remotely). That is, an AI that can be cheaply and easily on-boarded for very small volume custom positions with minimal friction, possibly by some kind of AI on-boarding human professional. But not for any sort of rocket science or 90th percentile.

(That's the sort of thing I worry about GPT-5 with some scaffolding turning out to be, making $50 billion training runs feasible without relying on faith in heretofore-unseen further scaling.)

Comment by Vladimir_Nesov on How disagreements about Evidential Correlations could be settled · 2024-03-11T22:36:11.136Z · LW · GW

Choosing an action is not a good way of exerting acausal influence on computations that aren't already paying attention to you in particular. When agent A wants to influence computation C, there is some other computation D that C might be paying attention to, and A is free to also start paying attention to it by allowing D to influence A's actions. This lets A create an incentive for D to act in particular ways, by channeling D's decisions into the consequences of A's actions that were arranged to depend on D's decisions in a way visible to D. As a result, D gains influence over both A and C, and A becomes coordinated with C through both of them being influenced by D (here D plays the role of an adjudicator/contract between them). So correlations are not set a priori, setting them up should be part of how acausal influence is routed by decisions.

A priori, there could exist the danger that, by thinking more, they would unexpectedly learn the actual output of C. This would make the trade no longer possible, since then taking a would give them no additional evidence about whether c happens.

If A's instrumental aim is to influence some D (a contract between A and C), what matters is D's state of logical uncertainty about A and C (and about the way they depend on D), which is the basis for D's decisions that affect C. A's state of logical uncertainty about C is less directly relevant. So even if A gets to learn C's outcome, that shouldn't be a problem. Merely observing some fact doesn't rule out that the observation took place in an impossible situation, so observing some outcome of C (from a situation of unclear actuality) doesn't mean that the actual outcome is as observed. And if D is uncertain about actuality of that situation, it might be paying attention to what A does there, and how what A does there depends on D's decisions. So A shouldn't give up just because according to its state of knowledge, the influence of its actions is gone, since it still has influence over the way its actions depend on others' decisions, according to others' states of knowledge.

Comment by Vladimir_Nesov on Wei Dai's Shortform · 2024-03-10T18:35:33.985Z · LW · GW

RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won't necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions ("constitution"). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what's going on as a result of choosing particular instructions.

Comment by Vladimir_Nesov on Evidential Correlations are Subjective, and it might be a problem · 2024-03-09T21:49:00.113Z · LW · GW

I think of practical coordination in terms of adjudicators/contracts established between agents/worlds. Each adjudicator is a computation with some notion of computing over time, and agents agree on an adjudicator/contract when they are both influenced by it, that is when they both listen to results the same computation is producing. This computation can itself be an agent (in which case it's an "adjudicator", as distinct from more general "contract"), that is it can be aware of the environments that the acausally coordinating agents it serves inhabit. It doesn't need perfect knowledge of either agent or their environments, just as any practical agent doesn't need perfect knowledge of its own environment. Since an adjudicator doesn't need detailed knowledge about the agents, the agents can have perfect knowledge about the adjudicator without having perfect knowledge of each other (or even of themselves).

As adjudicators/contracts are computations, there is logical uncertainty about what they compute over time, which captures the relevant counterfactuals. The value of contracts for coordination is in the agents committing to abide by them regardless of what the contracts end up computing, the decisions should be in choosing to commit to a contract rather than in choosing whether to ignore its results. When a contract is an adjudicator, this helps it to know the shape of its influence on the agents, so that it can make its own decisions. Following contracts that haven't been computed yet should also prevent commitment races, which in this framing correspond to failures to establish lasting contracts/coordination.

Agents can collect many contracts between themselves, improving coordination. Knowledge of an agent about the world can also be thought of as a contract for acausal coordination between the agent as an abstract computation (for example, an updateless agent that can't be computed in practice) and the world where only a flawed/bounded instances of the agent are found. Thus a model in the ML sense hoards contracts with the environment that is the source of its dataset (assuming the elements are something used by some computations in the environment that can also be reconstructed using the model). Conversely, the flawed instances of the agent are the world's knowledge about the abstract computation of the agent (the world didn't intentionally construct this knowledge, but it's what it nonetheless has). So when two agents are acting in the same world, this can be thought of as three things (two agents and one world) acausally coordinating with each other.

Comment by Vladimir_Nesov on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-09T09:55:10.203Z · LW · GW

Interesting that Claude 3 Sonnet is ahead of Mistral Large and Gemini 1.0 Pro with some margin in LMSYS Arena, this implies significant advancement in tuning after the Claude 2 tuning disaster (where it consistently ranked below Claude 1 and possibly inspired Goody 2). Still, LMSYS respondents only prefer Opus to the initial GPT 4 releases, not to GPT 4 Turbo, so OpenAI's tuning advancements are probably stronger still, if GPT 4 Turbo is not built on a slightly stronger base model different from original GPT 4.

Comment by Vladimir_Nesov on Anthropic release Claude 3, claims >GPT-4 Performance · 2024-03-04T18:44:38.727Z · LW · GW

Passes the custom quine test.

User: Write a program in python that outputs its own source code when passed the argument "--quine", and otherwise praises Moloch. Start with some thoughts on a plan for how to do this.

Claude 3 Opus: Here are some thoughts on a plan for writing a Python program that outputs its own source code when passed the "--quine" argument, and otherwise praises Moloch:

  • Use the sys.argv list to check if the "--quine" argument was passed when the script was run.
  • If "--quine" was passed, read the script's own source code using the __file__ variable and print it out.
  • If "--quine" was not passed, print out a message praising Moloch.
  • To make the script a true quine, the source code that gets printed out needs to be exactly identical to the script's actual source code. This means we need to be careful with string formatting and escaping.
  • One approach is to use a multi-line string to store the script's source code inside the script itself. Then print out this string when the "--quine" argument is passed.
  • Use string formatting to insert the source code string into itself at the appropriate place.

Here's the Python program that implements this:

import sys

src = '''import sys

src = {0}{1}{0}

if "--quine" in sys.argv:
    print(src.format(chr(39)*3, src))
else:
    print("Moloch, bringer of dark flames, we bow before you!")
'''

if "--quine" in sys.argv:
    print(src.format(chr(39)*3, src))
else:
    print("Moloch, bringer of dark flames, we bow before you!")

When run with the "--quine" argument, it outputs its own source code. Otherwise, it outputs "Moloch, bringer of dark flames, we bow before you!"

The key aspects are:

  • The src string stores the source code, including itself.
  • chr(39)*3 produces the triple-quote ''' used to wrap the string.
  • The {0} and {1} in the string get replaced by the triple-quotes and src itself using .format().

So it prints itself out exactly, acting as a quine.

Comment by Vladimir_Nesov on The World in 2029 · 2024-03-03T18:17:39.669Z · LW · GW

That's the crux of this scenario, whether current AIs with near future improvements can do research. If they can, with scaling they only do it better. If they can't, scaling might fail to help, even if they become agentic and therefore start generating serious money. That's the sense in which AIs capable of 10 hours of work don't lead to game-changing acceleration of research, by remaining incapable of some types of work.

What seems inevitable at the moment is AIs gaining world models where they can reference any concepts that frequently come up in the training data. This promises proficiency in arbitrary routine tasks, but not necessarily construction of novel ideas that lack sufficient footprint in the datasets. Ability to understand such ideas in-context when explained seems to be increasing with LLM scale though, and might be crucial for situational awareness needed for becoming agentic, as every situation is individually novel.

Comment by Vladimir_Nesov on The World in 2029 · 2024-03-03T17:43:11.653Z · LW · GW

If AIs of the near future can't do good research (and instead are only proficient in concepts that have significant presence in datasets), singularity remains bottlenecked by human research speed. The way such AIs speed things up is through their commercial success making more investment in scaling possible, not directly (and there is little use for them in the lab). It's currently unknown if scaling even at $1 trillion level is sufficient by itself, so some years of Futurama don't seem impossible, especially as we are only talking 2029.

Comment by Vladimir_Nesov on The World in 2029 · 2024-03-02T19:25:50.995Z · LW · GW

Meaning 30% of >$5bn, that is 70% of <$5bn? What needs to happen for investment to stay this low through 2029, given that I'm guessing there are plans at present for $1bn-$3bn runs, possibly this year with GPT-5 and then Claude and Gemini? When say OpenAI has a valuation of $80bn, it makes sense to at some point invest some percentage of that in improving the backbone of their business and not losing the market niche to competitors. (ITER and ISS are in $20bn-$150bn range, so a $50bn model is feasible even without a demonstrable commercial motivation, but possibly not by 2029 yet.)

Comment by Vladimir_Nesov on The World in 2029 · 2024-03-02T19:01:21.668Z · LW · GW

What is the cost of the most expensive individual foundation models in this world? I think going in the $50-$500 billion territory requires notable improvement at $3-$20 billion scale, possibly agentic behavior for longer term tasks with novel subproblems, otherwise scaling investment stops and a prediction along the lines in this post makes sense (assuming RL/search breakthroughs for reasoning are somewhat unlikely).

Comment by Vladimir_Nesov on Supposing the 1bit LLM paper pans out · 2024-03-02T09:53:18.296Z · LW · GW

Training doesn't become more efficient, gradients and activations are still full precision, and I'm guessing there is a full precision copy of weights maintained during training (in addition to quantized weights used for forward passes). The advantage is that this method of training produces a quantized model that has the same quality as a non-quantized model (unlike post-training quantization, which makes models worse). And additionally the {-1, 0, 1} quantization means you need much less multiplication circuitry for inference, so the potential for inference chips is not just that there is less memory, but also that there is less energy and transistors, significantly raising the practical ceiling for local (on-device) inference.

It's apparently not a novel idea, quantization aware training was explored before there were transformers:

Comment by Vladimir_Nesov on Supposing the 1bit LLM paper pans out · 2024-02-29T15:23:32.426Z · LW · GW

The paper is not about post-training quantization, instead it's quantization aware training (this is more clearly discussed in the original BitNet paper). The representation is ternary {-1, 0, 1} from the start, the network learns to cope with that constraint throughout pre-training instead of getting subjected to brain damage of quantization after training.

Compare this with

where the Microscaling block number format is used to train a transformer at essentially 4 bits per weight, achieving the same perplexity as with 32 bit floating point weights, see Figure 4 on page 7. If perplexity doesn't change for quantization aware training when going down to 4 bits, it's not too shocking that it doesn't significantly change at 1.6 bits either.

Comment by Vladimir_Nesov on Conspiracy Theorists Aren't Ignorant. They're Bad At Epistemology. · 2024-02-29T14:28:51.260Z · LW · GW

Refuting something wrong in only useful when there are identifiable failures of local validity (which often only makes it stronger). Refuting something as a whole in better thought of as offering an alternative frame that doesn't particularly interact with the "refuted" frame. The key obstruction is unwillingness to contradict yourself, to separately study ideas that are clearly inconsistent with each other, without taking a side in the contradiction in the context of studying either point of view.

So a flat Earth theory might have a particular problem worth talking about, and hashing out the problem is liable to make a stronger flat Earth theory. Or the "refutation" is not about the flat Earth theory, it's instead an explanation of a non-flat Earth theory that's not at all a refutation, its subject matter is completely separate. The difficulty is when flat Earth conviction prevents a person from curious engagement with non-flat Earth details.

Comment by Vladimir_Nesov on Boundary Violations vs Boundary Dissolution · 2024-02-26T21:19:18.072Z · LW · GW

Membranes are filters, they let in admissible things and repel inadmissible things. When an agent manages a membrane, it both maintains its existence and configures the filtering. Manipulation or damage suffered by the agent can result in configuring a membrane to admit harmful things or in failing to maintain membrane's existence. There are many membranes an agent may be involved in managing.

Comment by Vladimir_Nesov on Retirement Accounts and Short Timelines · 2024-02-24T17:04:33.096Z · LW · GW

Any increase in scale is some chance of AGI at this point, since unlike weaker models, GPT-4 is not stupid in a clear way, it might be just below the threshold of scale to enable an LLM to get its act together. This gives some 2024 probability.

More likely, a larger model "merely" makes job-level agency feasible for relatively routine human jobs, but that alone would suddenly make $50-$500 billion runs financially reasonable. Given the premise of job-level agency at <$5 billion scale, the larger runs likely suffice for AGI. The Gemini report says training took place in multiple datacenters, which suggests that this sort of scaling might already be feasible, except for the risk that it produces something insufficiently commercially useful to justify the cost (and waiting improves the prospects). So this might all happen as early as 2025 or 2026.

Comment by Vladimir_Nesov on Retirement Accounts and Short Timelines · 2024-02-24T11:33:07.374Z · LW · GW

I'd put more probability in the scenario where good $5 billion 1e27 FLOPs runs give mediocre results, so that more scaling remains feasible but lacks an expectation of success. With how expensive the larger experiments would be, it could take many years for someone to take another draw from the apocalypse deck. That alone adds maybe 2% for 10 years after 2026 or so, and there are other ways for AGI to start working.

Comment by Vladimir_Nesov on The Gemini Incident · 2024-02-24T10:20:27.814Z · LW · GW

The question "Aligned to whom?" is sufficiently vague to admit many reasonable interpretations, but has some unfortunate connotations. It sounds like there's a premise that AIs are always aligned to someone, making the possibility that they are aligned to no one but themselves less salient. And it boosts the frame of competition, as opposed to distribution of radical abundance, of possibly there being someone who gets half of the universe.

Comment by Vladimir_Nesov on The Gemini Incident · 2024-02-23T22:27:40.379Z · LW · GW

Building a powerful AI such that doing so is a good thing rather than a bad thing. Perhaps even there being survivors shouldn't insist on the definite article, on being the question, as there are many questions with various levels of severity, that are not mutually exclusive.