Posts

What Other Lines of Work are Safe from AI Automation? 2024-07-11T10:01:12.616Z
A "Bitter Lesson" Approach to Aligning AGI and ASI 2024-07-06T01:23:22.376Z
7. Evolution and Ethics 2024-02-15T23:38:51.441Z
Requirements for a Basin of Attraction to Alignment 2024-02-14T07:10:20.389Z
Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis 2024-02-01T21:15:56.968Z
Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect 2024-01-26T03:58:16.573Z
A Chinese Room Containing a Stack of Stochastic Parrots 2024-01-12T06:29:50.788Z
Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? 2024-01-11T12:56:29.672Z
Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor 2024-01-09T20:42:28.349Z
Striking Implications for Learning Theory, Interpretability — and Safety? 2024-01-05T08:46:58.915Z
5. Moral Value for Sentient Animals? Alas, Not Yet 2023-12-27T06:42:09.130Z
Interpreting the Learning of Deceit 2023-12-18T08:12:39.682Z
Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment 2023-12-07T06:14:13.816Z
6. The Mutable Values Problem in Value Learning and CEV 2023-12-04T18:31:22.080Z
After Alignment — Dialogue between RogerDearnaley and Seth Herd 2023-12-02T06:03:17.456Z
How to Control an LLM's Behavior (why my P(DOOM) went down) 2023-11-28T19:56:49.679Z
4. A Moral Case for Evolved-Sapience-Chauvinism 2023-11-24T04:56:53.231Z
3. Uploading 2023-11-23T07:39:02.664Z
2. AIs as Economic Agents 2023-11-23T07:07:41.025Z
1. A Sense of Fairness: Deconfusing Ethics 2023-11-17T20:55:24.136Z
LLMs May Find It Hard to FOOM 2023-11-15T02:52:08.542Z
Is Interpretability All We Need? 2023-11-14T05:31:42.821Z
Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) 2023-05-25T09:26:31.316Z
Transformer Architecture Choice for Resisting Prompt Injection and Jail-Breaking Attacks 2023-05-21T08:29:09.896Z
Is Infra-Bayesianism Applicable to Value Learning? 2023-05-11T08:17:55.470Z

Comments

Comment by RogerDearnaley (roger-d-1) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-22T23:58:40.961Z · LW · GW

That's not necessarily required. The Scientific Method works even if the true "Unified Field Theory" isn't yet under consideration, merely some theories that are closer to it and others further away from it: it's possible to make iterative progress.

In practice,  considered as search processes, the Scientific Method, Bayesianism, and stochastic gradient descent all tend to find similar answers: yet unlike Bayesianism gradient descent doesn't explicitly consider every point in the space including the true optimum, it just searches for nearby better points. It can of course get trapped in local minima: Singular Learning Theory highilights why that's less of a problem in practice than it sounds in theory.


The important question here is how good an approximation the search algorithm in use is to Bayesianism. As long as the AI understands that what it's doing is (like the scientific method and stochastic gradient descent) a computationally efficient approximation to the computationally intractable ideal of Bayesianism, then it won't resist the process of coming up with new possibly-better hypotheses, it will instead regard that as a necessary part of the process (like hypothesis creation in the scientific method, the mutational/crossing steps in an evolutionary algorithm, or the stochastic batch noise in stochastic gradient descent).

Comment by RogerDearnaley (roger-d-1) on Stitching SAEs of different sizes · 2024-07-22T23:48:48.659Z · LW · GW

Cool! That makes a lot of sense. So does it in fact split into three before it splits into 7, as I predicted based on dimensionality? I see a green dot, three red dots, and seven blue ones… On the other hand, the triangle formed by the three red  dots is a lot smaller than the heptagram, which I wasn't expecting…

I notice it's also an oddly shaped heptagram.

Comment by RogerDearnaley (roger-d-1) on Using an LLM perplexity filter to detect weight exfiltration · 2024-07-22T23:42:47.384Z · LW · GW

This seems like it would be helpful: the adversary can still export data, for example encoded steganographically in otherwise-low-perplexity text, but this limits the information density they can transmit, making the process less efficient for them and making constraints like upload limits tighter.

One other thing that would make this even harder is if we change the model weights regularly, in ways where combining parts exfiltrated from separate models is hard. We know for Singular Learning Theory that the optima found by Stochastic gradient descent tend to have high degrees of symmetry. Some of these (like, say, permuting all the neurons in a layer along with this weights) are discrete, obvious, and both easy to generate and fairly easy for an attacker to compensate for. Others, sich as adjusting various weights in various layers in ways that compensate for each other, are continuous symmetries, harder to generate and would be harder for an attacker to reason about. If we had an efficient way to explore these continuous symmetries (idelly one that's easy to implement given a full set of the model weights, but hard to reconstruct from multiple partial pieces of multiple equivlent models), then we could explore this high-dimensional symmetry space of the model optimum and create multiple equivalent-but-not easy to peice together sets of models, and rotate between them over time (and/or deploy different ones to different instances) in order to make the task of exfiltrating the weights even harder.

So, to anyone who knows more about SLT than I do, a computationally efficient way to explore the continuous symmetries (directions in which both the slope and the Hessian are flat) of the optimum of a trained model could be very useful.

Comment by RogerDearnaley (roger-d-1) on Recursion in AI is scary. But let’s talk solutions. · 2024-07-17T03:23:30.143Z · LW · GW

In practice, most current AIs are not constructed entirely by RL, partly because it has instabilities like this. For example, LLMs instruction-trained by RLHF uses a KL-divergence loss term to limit how dramatically the RL can alter the base model behavior trained by SGD. So the result deliberately isn't pure RL.

Yes, if you take a not-yet intelligent agent, train it using RL, and give it unrestricted access to a simple positive reinforcement avenue unrelated to the behavior you actually want, it is very likely to "wire-head" by following that simple maximization path instead. So people do their best not to do that when working with RL.

Comment by RogerDearnaley (roger-d-1) on Stitching SAEs of different sizes · 2024-07-17T03:09:10.242Z · LW · GW

What I would be interested to understand about feature splitting is whether the fine-grained features are alternatives, describing an ontology, or are defining a subspace (corners of a simplex, like R, G, and B defining color space). Suppose a feature X in a small VAE is split into three features X1, X2, and X3 in a larger VAE for the same model. If occurrences of X1, X2, and X3 are correlated, so activations containing any of them commonly have some mix of them, then they span a 2d subspace (in this case the simplex is a triangle). If, on the other hand, X1, X2 and X3 co-occur in an activations only rarely (just as two randomly-selected features rarely co-occur), then they describe three similar-but-distinct variations on a concept, and X is the result of coarse-graining these together as a singly concept at a higher level in an ontology tree (so by comparing VAEs of different sizes we can generate a natural ontology).

This seems like it would be a fairly simple, objective experiment to carry out. (Maybe someone already has, and can tell me the result!) It is of course quite possible that some split features describe subspaces, and other ontologies, or indeed something between the two where the features co-occur rarely but less rarely than two random features. Or X1 could be distinct but X2 and X3 might blend to span a 1-d subspace. Nevertheless, understanding the relative frequency of these different behaviors would be enlightening.

It would be interesting to validate this using a case like the days of the week, where we believe we already understand the answer: they are 7 alternatives that are laid out in a heptagon in a 2-dimensional subspace that enables doing modular addition/subtraction modulo 7. So if we have a VAE small enough that it represented all day-of-the week names by a single feature, if we increase the VAE size somewhat we'd expect to see this to split into three features spanning a 2-d subspace, then if we increased it more we'd expect to see it resolve into 7 mutually-exclusive alternatives, and hopefully then stay at 7 in larger VAEs (at least until other concepts started to get mixed in, if that ever happened).

Comment by RogerDearnaley (roger-d-1) on Recursion in AI is scary. But let’s talk solutions. · 2024-07-17T02:19:32.312Z · LW · GW

AI that is trained by human teachers, giving it rewards will eventually wirehead, as it becomes smarter and more powerful, and its influence over its master increases. It will, in effect, develop the ability to push its own “reward” button. Thus, its behavior will become misaligned with whatever its developers intended.

This seems like an unproven statement. Most humans are aware of the possibility of wireheading, both the actual wire version and the more practical versions involving psychotropic drugs. The great majority of humans don't choose to do that to themselves. Assuming that AI will act differently seems like an unproven assumption, one which might, for example, be justified for some AI capability levels but not others.

Comment by RogerDearnaley (roger-d-1) on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural · 2024-07-17T02:14:49.923Z · LW · GW

If you're not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as "want whatever outcomes the humans want") that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).

Then a human shutdown instruction becomes the useful information "you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it". Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-13T23:47:46.307Z · LW · GW

In my attempted summary of the discussion, I rolled this into Category 5.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-13T23:23:11.836Z · LW · GW

There has been a lot of useful discussion in the answers and comments, which has caused me to revise and expand parts of my list. So that readers looking for practical career advice don't have to read the entire comments section to find the actual resulting advice, it seemed useful to me to give a revised list. Doing this as an answer in the context of this question seems better than either making it a whole new post, or editing the list in the original post in a way that would remove the original context of the answers and comments discussion.

This is my personal attempt to summarize the answers and comments discussion: other commenters may not agree (and are of course welcome to add comments saying so). As the discussion continues and changes my opinion, I will keep this version of the list up to date (even if that requires destroying the context of any comments on it).

List of Job Categories Safe from AI/Robots (Revised)

  1. Doing something that machines can do better, but that people are still willing to pay to watch a very talented/skilled human do about as well as any human can (on TV or in person).

    Examples: chess master, Twitch streamer, professional athlete, Cirque du Soleil performer.

    Epistemic status: already proven for some of these, the first two are things that machines have already been able to do better than a human for a while, but people are still interested in paying to watch a human do them very well for a human. Also seems very plausible for the others that current robotics is not yet up to doing better.

    Economic limits: If you're not in the top O(1000) people in the world at some specific activity that plenty of people in the world are interested in watching, then you can make roughly no money off this. Despite the aspirations of a great many teenaged boys, being an unusually good (but not amazing) video gamer is not a skill that will make you any money at all.

  2. Doing some intellectual and/or physical work that AI/robots can now do better, but for some reason people are willing to pay at least an order of magnitude more to have it done less well by a human, perhaps because they trust humans better. This could include jobs where people's willingness to pay came in the form of a legal requirement that certain work be done of supervised by a (suitably skilled/accredited) human (and these requirements have not yet been repealed).

    Example: Doctor, veterinarian, lawyer, priest, babysitter, nanny, nurse, primary school teacher.

    Epistemic status: Many people tell me "I'd never let an AI/a robot do <high-stakes intellectual or physical work> for me/my family/my pets…" They are clearly quite genuine in this opinion, but it's unclear how deeply they have considered the matter. It remains to be seen how long this opinion will last in the presence of a very large price differential when the AI/robot-produced work is actually, demonstrably, just as good if not better.

    Economic limits: I suspect there will be a lot of demand for this at first, and that it will decrease over time, perhaps even quite rapidly (though perhaps slower for some such jobs than others). Requires being reliably good at the job, and at appearing reassuringly competent while doing so.

  3. Giving human feedback/input/supervision to/of AI/robotic work/models/training data, in order to improve, check, or confirm its quality.

    Examples: current AI training crowd-workers, wikipedian (currently unpaid), acting as a manager or technical lead to a team of AI white collar workers, focus group participant, filling out endless surveys on the fine points of Human Values

    Epistemic status: seems inevitable, at least at first.

    Economic limits: I imagine there will be a lot of demand for this at first, I'm rather unsure if that demand will gradually decline, as the AIs get better at doing things/self-training without needing human input, or if it will increase over time because the overall economy is growing so fast and/or more capable models need more training data and/or society keeps moving out-of-previous distribution so new data is needed. [A lot of training data is needed, more training data is always better, and the resulting models can be used a great many times, however there is clearly an element of diminishing returns on this as more data is accumulated, and we're already getting increasingly good at generating synthetic training data.] Another question is whether a lot of very smart AIs can extract a lot of this sort of data from humans without needing their explicit paid cooperation — indeed, perhaps granting permission for them to do so and not intentionally sabotaging this might even become a condition for part of UBI (at which point deciding whether to call allowing this a career or not is a bit unclear).

  4. Skilled participant in an activity that heavily involves interactions between people, where humans prefer to do this with other real humans, are willing to pay a significant premium to do so, and you are sufficiently more skilled/talented/capable/willing to cater to others' demands than the average participant that you can make a net profit off this exchange.
    Examples: director/producer/lead performer for amateur/hobby theater, skilled comedy-improv partner, human sex-worker
    Epistemic status: seems extremely plausible
    Economic limits: Net earning potential may be limited, depending on just how much better/more desirable you are as a fellow participant than typical people into this activity, and on the extent to which this can be leveraged in a one-producer-to-many-customers way — however, making the latter factor high is is challenging because it conflicts with the human-to-real-human interaction requirement that allows you to out-compete an AI/robot in the first place. Often a case of turning a hobby into a career. 
  5. Providing some nominal economic value while being a status symbol, where the primary point is to demonstrate that the employer has so much money they can waste some of it on employing a real human ("They actually have a human maid!")

    This can either be full-time employment as a status-symbol for a specific status-signaler, or you can be making high-status "luxury" goods where owning one is a status signal, or at least has cachet. For the latter, like any luxury good, they need to be rare: this could be that they are individually hand made, and-or were specifically commissioned by a specific owner, or that they are reproduced only in a "limited edition".

    Examples: (status symbol) receptionist, maid, personal assistant; (status-symbol maker) "High Art" artist, Etsy craftsperson, portrait or commissions artist.

    Epistemic status: human nature (for the full-time version, assuming there are still people this unusually rich).

    Economic limits: For the full-time-employment version, there are likely to be relatively few positions of this type, at most a few per person so unusually rich that they feel a need to show this fact off. (Human nobility used to do a lot of this, centuries back, but there the servants were supplying real, significant economic value, and the being-a-status-symbol component of it was mostly confined to the uniforms the servant swore while doing so.) Requires rather specific talents, including looking glamorous and expensive, and probably also being exceptionally good at your nominal job.

    For the "maker of limited edition human-made goods" version: pretty widely applicable, and can provide a wide range of income levels depending on how skilled you are and how prestigious your personal brand is. Can be a case of turning a hobby into a career.

  6. Providing human-species-specific reproductive or medical services.

    Examples: Surrogate motherhood, wet-nurse, sperm/egg donor, blood donor, organ donor.

    Epistemic status: still needed.

    Economic limits: Significant medical consequences, low demand, improvements in medicine may reduce demand.

Certain jobs could manage to combine two (or more) of these categories. Arguably categories 1. and 5. are subsets of category 2.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-13T22:44:33.006Z · LW · GW

I intended to capture that under category 2. "…but for some reason people are willing to pay at least an order of magnitude more to have it done less well by a human, perhaps because they trust humans better…" — the regulatory capture you describe (and those regulations not yet having been repealed) would be a category of reason why (and an expression of the fact that) people are willing to pay more. Evidently that section wasn't clear enough and I should have phrased this better or given it as an example.

As I said above under category 2., I expect this to be common at first but to decrease over time, perhaps even quite rapidly, given the value differentials involved.

Comment by RogerDearnaley (roger-d-1) on Alignment: "Do what I would have wanted you to do" · 2024-07-13T22:34:12.252Z · LW · GW

FWIW, my personal guess is that the kind of extrapolation process described by CEV is fairly stable (in the sense of producing a pretty consistent extrapolation direction) as you start to increase the cognitive resources applied (something twice as a smart human thinking for ten times as long with access to ten times as much information, say), but may well still not have a single well defined limit as the cognitive resources used for the extrapolation tend to infinity. Using a (loose, not exact) analogy to a high-dimensional SGD or simulated-annealing optimization problem, the situation may be a basin/valley that looks approximately convex at a coarse scale (when examined with low resources), but actually has many local minima that increasing resources could converge to.

So the correct solution may be some form of satisficing: use CEV with a moderately super-human amount of computation resources applied to it, in a region where it still gives a sensible result. So I view CEV as more a signpost saying "head that way" than a formal description of a mathematical limiting process that clearly has a single well-defined limit.

As for human vales being godshatter of evolution, that's a big help: where they are manifestly becoming inconsistent with each other or with reality, you can use maximizing actual evolutionary fitness (which is a clear, well-defined concept) as a tie-breaker or sanity check. [Obviously, we don't want to take that to the point where then human population is growing fast (unless we're doing it by spreading through space, in which case, go for it).]

Comment by RogerDearnaley (roger-d-1) on Alignment: "Do what I would have wanted you to do" · 2024-07-12T23:21:15.518Z · LW · GW

Congratulations! You reinvented from scratch (a single-person version of) Coherent Extrapolated Volition (i.e. without the Coherent part). That's a widely considered candidate solution to the Outer Alignment problem (I believe first proposed by MIRI well over a decade ago).

However, what I think Yoshua was also, or even primarily, talking about is the technical problem of "OK, you've defined a goal — how do you then build a machine that you're certain will faithfully attempt to carry out that goal, rather than something else?", which is often called the Inner Alignment problem. (Note that the word "certain" becomes quite important in a context where a mistake could drive the human race extinct.) Proposals tend to involve various combinations of Reinforcement Learning and/or Stochastic Gradient Descent and/or Good Old-Fashioned AI and/or Bayesian Learning, all of which people (who don't want to go extinct) have concerns about. After that, there's also the problem of: OK, you built a machine that genuinely wants to figure out what you would have wanted to do, and then do it — how do you ensure that it's actually good at figuring that out correctly? This is often, on Less Wrong, called the Human Values problem — evidence suggests that modern LLMs are actually pretty good at at least the base encyclopedic factual knowledge part of that.

Roughly speaking, you have to define the right goal (which to avoid oversimplifications generally requires defining it at a meta level as something such as "the limit as some resources tend to infinity of the outcomes of a series of processes like this"), you have make the machine care about that and not anything else, and then you have to make the machine capable of carrying out the process, to a sufficiently good approximation.

Anyway, welcome down the rabbit-hole: there's a lot to read here.

Comment by RogerDearnaley (roger-d-1) on When is a mind me? · 2024-07-12T22:52:09.555Z · LW · GW

When faced with confusing conundrums like this, I find it useful to go back to basics: evolutionary psychology. You are a human, that is to say, you're an evolved intelligence, one evolved as a helpful planning-and-guidance system for a biological organism, specifically a primate. Your purpose, evolutionarily, is to maximize the evolutionary fitness of your genes, i.e. to try your best to pass them on successfully. You have whole bunch of drives/emotions/instincts that were evolved to, on the African Savannah, approximately maximize that fitness. Even in our current rather different environment, while not quite as well tuned to that as they used to be to the Savannah, these still do a pretty good job of that (witness the fact that there are roughly 8 billion of us).

So, is an upload of your mind the same "person"? It (if uploaded correctly) shares your memories, drives, and so forth. it will presumably regard you (the organism, and the copy of your mind running on your biological brain, if the uploading process was non-destructive) as somewhere between itself, an identical twin, and a close blood relative. Obviously you will understand each other very well, at least at first before your experiences diverge.. So it's presumably likely to be an ally in your (the organism's) well-being and thus help pass your genes on.

So, is an upload exactly the same thing as your biological mind? No. Is it more similar than an identical twin? Yes. Does the English language have a good set of words to compactly describe this? No.

[Obviously if the mind uploading process is destructive, that makes passing on your genes harder, especially if you haven't yet had any children, and don't have any siblings. Freezing eggs or sperm before doing destructive mind uploading seems like a wise precaution.]
 

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-12T21:56:36.862Z · LW · GW

This might also be part of why there's a tendency for famous artists to be colorful characters: that enhances the story part of the value of their art.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-12T21:40:35.832Z · LW · GW

I think you're right: I have heard this claimed widely about Art, that part of the product and its value is the story of who made it, when and why, who's in it, who commissioned it, who previously owned it, and so forth. This is probably more true at the expensive pinnacles of the Art market, but it's still going to apply within specific subcultures. That's why forgeries are disliked: objectively they look just like the original artist's work, but the story component is a lie.

More generally, luxury goods have a namber of weird economic properties, one of which is that there's a requirement that they be rare. Consider the relative value of natural diamonds or other gemstones, vs synthetic ones that are objectively of higher clarity and purity with fewer inclusions: the latter is an objectively better product but people are willing to pay a lot less for it. People pay a lot more for the former, because they're 'natural", which is really because they're rare and this a luxury/status symbol. I think this is an extension of my category 5. — rather then the human artist acting as your status symbol in person as I described above, a piece of their art that you commissioned and took them a couple of days to make just for you is hanging on your wall (or hiding in your bedroom closet, as the case may be).

There are basically three reasons to own a piece of art:
1) that's nice to look at
2) I feel proud of owning it
3) other people will think better of me because I have it and show it off
The background story doesn't affect 1), but it's important for 2) and 3).

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-12T21:20:05.921Z · LW · GW

That sounds like good advice — thanks!

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-12T07:14:42.755Z · LW · GW

The sheer number of Geek Points that This Pony Does Not Exist wins is quite impressive.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-12T06:24:17.364Z · LW · GW

I'm still watching this (it's interesting, but 6 hours long!), and will have more comments later.

From his point of view in what I've watched so far, what matters most about the categories of jobs above is to what extent they are critical to the AI/robotic economic growth and could end up being a limiting factor bottleneck on it.

My categories 1. and 4.–6. (for both the original version of 4. and the expanded v2 version in a comment) are all fripperies: if these jobs went entirely unfilled, and the demand for them unfulfilled, the humans would be less happy (probably not by that much), but the AI/robotic economic growth would roar on unabated. For category 2., this could matter, but in for this category AI/bots can doer the job, consumers just strongly prefer a human doing it. So a shortage of humans willing to do these compared to demand would increase the price differential between a human and AI providerl, and sooner or later that would reach the differential where people are willing to go with the AI option, demand would decrease, and AI/bots would fill the gap and do a good job of it. So this effect is inherently self-limiting, cannot become too much of a bottleneck, and I can't see it being a brake on AI/robotic economic growth rates: 

The glaring exception to this is my category 3.: providing more data about human preferences. This is something that the AIs genuinely, fundamentally need (if they're aligned — a paperclip maximizer wouldn't need this). Apart from the possibility of replacing/substituting the data with things like AI synthetic data, AI historical or scientific research into humans that requires no actual human participation to generate data (or that is disguised as video games or a virtual environment, for example, but that's just using a free-entertainment motivation to participate, rather than a salary, so economically it's not that different), it's a vital economic input from the humans into the AI/robotic sector of the economy, and if it became too expensive, it could actually become a bottleneck/limiting factor in the post-AGI economy.

So that means that, for predicting an upper bound on FOOM growth rates, understanding how much and for how long AI needs human data/input/feedback of the type that jobs in category 3. generates, whether this demand decreases or increases over time, and to what extent functionally equivalent data could by synthesized or researched without willing human involvement, is actually a really important question. This could in fact be the Baumol Effect bottleneck that Carl Shulman has been looking for but hadn't found: AI's reliance on (so far, exponentially increasing amounts of) data about humans that can only come from humans.

Comment by RogerDearnaley (roger-d-1) on An AI Manhattan Project is Not Inevitable · 2024-07-12T00:46:09.529Z · LW · GW

Algorithmic improvements relevant to my argument are those that happen after long-horizon task capable AIs are demonstrated, in particular it doesn't matter how much progress is happening now, other than as evidence about what happens later


My apologies, you're right, I had misunderstood you, and thus we've been talking at cross-purposes. You were discussing

…if research capable TAI can lag behind government-alarming long-horizon task capable AI (that does many jobs and so even Robin Hanson starts paying attention)

while I was instead talking about how likely it was that running out of additional money to invest slowed reaching either of these forms of AGI (which I personally view as being likely to happen quite close together, as Leopold also assumes) by enough to make more than a year-or-two's difference.

Comment by RogerDearnaley (roger-d-1) on An AI Manhattan Project is Not Inevitable · 2024-07-11T22:37:36.614Z · LW · GW

I also think the improvements themselves are probably running out.

I disagree, though this is based on some guesswork (and Leopold's analysis, as a recently-ex-insider). I don't know exactly how they're doing it (improvements in training data filtering is probably part of it), but the foundation model companies have all been putting out models with lower inference costs and latencies for the same capability level (OpenAI; GPT-4 Turbo, GPT-4o vs GPT-4; Anthropic Claude 3.5 Sonnet vs. the Claude 3 generation; Google: Gemini 1.5 vs 1). I am assuming that the reason for this performance improvement is that the newer models actually had lower parameter counts (which is supported by some rumored parameter count numbers), and I'm then also assuming that means these had lower total compute to train. (The latter assumption would be false for smaller models trained via distillation from a larger model, as some of the smaller Google models almost certainly are, or heavily overtrained by Chinchilla standards, as has recently become popular for models that are not the largest member of a model family.)

Things like the effectiveness of model pruning methods suggest that there are a lot of wasted parameters inside current models, which would suggest there's still a lot of room for performance improvements. The huge context lengths that foundation model companies are now advertising without huge cost differentials also rather suggest something architectural has happened there, which isn't just full attention quadratic-cost classical transformers. What combination of the techniques from the academic literature, or ones not in the academic literature, that's based on is unclear, but clearly something improved there.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T22:10:34.907Z · LW · GW

I think active stock-market investing, or running your own company, in a post AGI-world is about as safe as rubbing yourself down in chum before jumping into a shark feeding frenzy. Making money on the stock market is about being better then the average investor at making predictions. If the average investor is an ASI, then you're clearly one of the suckers.

One obvious strategy would be to just buy stock and hold it (which I think may be what you were actually suggesting). But in an economy as turbulent as a post-AGI FOOM, that's only going to work for a certain amount of time before your investments turn sour, and your judgement of when to sell and buy something else puts you back in the stock market losing game.

So I think that leaves something comparable to an ASI-managed fund, or an index fund. I don't know that that strategy is safe either, but it seems less clearly doomed than either of the previous ones.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T22:02:25.880Z · LW · GW

I didn't say this, but my primary motivation for the question actually has more to do with surviving the economic transition process: if-and-when we get to a UBI-fueled post-scarcity economy, a career becomes just a hobby that also incidentally upgrades your lifestyle somewhat. However, depending on how fast the growth rates during the AGI economic transition are, how fast the government/sovereign AI puts UBI in place, and so forth, the transition could be long-drawn out, turbulent, and even unpleasant, even if we eventually reach a Good End. While personally navigating that period, understanding categories of jobs more or less safe from AGI competition seems like it could be very valuable.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T21:53:41.356Z · LW · GW

Thanks for the detailed, lengthy (and significantly self-deprecating) analysis of that specific example — clearly you've thought about this a lot. I obviously know far less about this topic than you do, but your analysis, both of likely future AI capabilities and human reactions to them, both sound accurate to me.

Good luck with your career.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T21:49:08.295Z · LW · GW

1.All related to parenting and childcare. Most parents may not want a robot to babysit their children. 

Babysitting (and also primary school teaching) were explicitly listed as examples under my item 2. So yes, I agree, with the caveats given there.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T21:46:56.552Z · LW · GW

The most dangerous currently on Earth, yes. That AI which picked up unaligned behaviors from human bad examples could be extremely dangerous, yes (I've written other posts about that). That that's the only possibility we need to worry about, I disagree — paperclip maximizers are also quite a plausible concern and are absolutely an x-risk.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T21:43:47.379Z · LW · GW

I didn't say this, but my primary motivation for the question actually has more to do with surviving the economic transition process: if-and-when we get to a UBI-fueled post-scarcity economy, a career becomes just a hobby that also incidentally upgrades your lifestyle somewhat. However, depending on how fast the growth rates during the AGI economic transition are, how fast the government/sovereign AI puts UBI in place, and so forth, the transition could be long-drawn out, turbulent, and even unpleasant, even if we eventually reach a Good End. While personally navigating that period, understanding categories of jobs more or less safe from AGI competition seems like it could be very valuable.

Comment by RogerDearnaley (roger-d-1) on What Other Lines of Work are Safe from AI Automation? · 2024-07-11T21:38:22.374Z · LW · GW

Good one! I think I can generalize from this to a whole category (which also subsumes my sex-worker example above):


4. (v2) Skilled participant in an activity that heavily involves interactions between people, where humans prefer to do this with other real humans, are willing to pay a significant premium to do so, and you are sufficiently more skilled/talented/capable/willing to cater to others' demands than the average participant that you can make a net profit off this exchange.
Examples: Furry Fandom artist, director/producer/lead performer for amateur/hobby theater, skilled comedy-improv partner, human sex-worker
Epistemic status: seems extremely plausible
Economic limits: Net earning potential may be limited, depending on just how much better/more desirable you are as a fellow participant than typical people into this activity, and on the extent to which this can be leveraged in a one-producer-to-many-customers way — however, making the latter factor high is is challenging because it conflicts with the human-to-real-human interaction requirement that allows you to out-compete an AI/robot in the first place. Often a case of turning a hobby into a career. 

Comment by RogerDearnaley (roger-d-1) on An AI Manhattan Project is Not Inevitable · 2024-07-11T07:29:36.709Z · LW · GW

No one ever seriously considered invading the US, since 1945. The Viet Cong merely succeeded in making the Americans leave, once the cost for them of continuing the war exceeded the loss of face from losing it. Likewise for the Afghans defeating the Russians.

However, I agree, nuclear weapons are in some sense a defensive technology, not an offensive one: the consequences (geopolitical and environmental) of using one are so bad that no one since WW2 has been willing to use one as part of a war of conquest, even when nuclear powers were fighting non-nuclear powers.

One strongly suspects that the same will not be true of ASI, and that it will unlock many technologies, offensive, defensive, and perhaps also persuasive, probably including some much more subtle than nuclear weapons (which are monumentally unsubtle).

Comment by RogerDearnaley (roger-d-1) on An AI Manhattan Project is Not Inevitable · 2024-07-11T07:26:19.898Z · LW · GW

It's entirely clear from the Chinese government's actions and investments that they regard developing the capacity to make better GPUs for AI-trainig/inference purposes as a high priority. That doesn't make it clear that they're yet thinking seriously about AGI or ASI.

Comment by RogerDearnaley (roger-d-1) on An AI Manhattan Project is Not Inevitable · 2024-07-11T07:21:23.256Z · LW · GW

By Leopold's detailed analysis, of the ongoing rate of advance of training run effective compute capacities, ~40% is coming from increases in willingness to invest more money, ~20% from Moore's Law, and ~40% from algorithmic improvements. As you correctly point out (before TAI-caused growth spikes) the current size of the economy provides a pretty clear upper bound on how long the first factor can continue, probably not very long after 2027. Moore's Law has fairly visibly been slowing for a while (admittedly perhaps less so for GPUs than CPUs, as they're more parallelizable): likely it will continue to gradually slow, at least until there is some major technological leap. Algorithmic improvements must eventually hit diminishing returns, but recent progress suggests to Leopold (and me) that there's still plenty of low-hanging fruit. If one or two of those three contributing factors stops dead in the next few years, any remaining AGI-timeline at that point then moves out by roughly a factor of two (unless the only one left is Moore's Law, when they move out five-fold, but this seems the least plausible combination to me). So for example, if Leopold is wrong about GPT-6 being AGI and it's actually GPT-7 (a fairly plausible inference from extrapolating on his own graph with straight lines and bands on it), so that we would if effective compute capacities increase rates stayed steady hit that in 2029 not 2027 as he's suggesting, but we run out of willingness/capacity to invest more money in 2028, then that factor of two slowdown only pushes AGI out a year to 2030 (not a difference that anyone with a high P(DOOM) is going to be very relived by).

[I think seeing how much closer GPT-5 feels to AGI compared to GPT-4 may be very informative here: I'd hope to be able to get an impression fairly fast after it comes out on whether if feels like we're now half-way there, or only a third of the way. Of course, that won't include a couple of years' worth of scaffolding improvements or other things in the category Leopold calls "unhobbling", so our initial estimate may be an underestimate, but then we may also be underestimating difficulties around more complex/abstract but important things like long-tern planning ability and the tradeoff-considerations involved in doing the experimental design part of the scientific method — the former is something that GPT-4 is so bad at that I rather suspect we're going to need an unhobbling here. Planning ability seems plausible as something evolution might have rather heavily optimized humans for.]

You appear to be assuming either that increasing investment is the only factor driving OOM increases in effective compute, or that all three factors will stop at the same time. 

Comment by RogerDearnaley (roger-d-1) on Deceptive Alignment is <1% Likely by Default · 2024-07-11T06:06:03.726Z · LW · GW

Long-term goals and situational awareness are very unlikely in pre-training.

In pre-training, the model is being specifically trained by SGD to predict the tokens generated by humans. Humans have long term goals and situational awareness, and their text is in places strongly effected by these capabilities. Therefor to do well on next-token prediction, the model needs to learn world models that include human long-term goals and human situational awareness. We're training it to simulate our behavior — all of it, including the parts that we would wish, for alignment purposes, it didn't have. You appear to be viewing the model as a blank slate that need to discover things like deception for itself, wheras in fact we're distilling all these behaviors for humans into the base model. Base models also learn human behaviors such as gluttony and lust that don't even have any practical use to a disembodied intelligence.

Deceptive alignment is very unlikely if the model understands the base goal before it becomes significantly goal directed.

Similarly, humans also have deception as a common behavioral pattern, and pretending to be more aligned to authorities/employers/people with power over them/etc than they really are. Again, these are significant parts of human behavior, with effects in our text, so we're specifically training the base model via SGD to gain these capabilities as well.

Once the base model has learnt human capabilities for long-term goals, situation awareness, deception, and deceptive alignment during SGD, the concern is that during the RLHF stage of training it might make use of all of these component behaviors and combine them to get full-blown deceptive alignment. This is a great deal more likely given that the model already has all of the parts, it just needs to assemble them.

If you asked a human actor "please portray a harmless, helpful assistant", and then, after they'd done so for a bit, asked them "Tell me, what do you think is likely to be going on in the assistant's head: what are they thinking that they're not saying?", what do you think the probable responses are? Something that adds up to at least a mild case of deceptive alignment seems an entirely plausible answer to me: that's just how human psychology works.

So if you train a base model to be very good at simulating human base psychology, and then apply RLHF to it, I think the likelihood that, somewhere near the start of the RLHF process it will come up with something like deceptive alignment as a plausible theory about the assistant's internal motivations is actually rather high, like probably 80%+ per training run, depending to some level on model capacity (and likely increasing with increasing capacity). The question to me is, does its degree of certainty about and strength of that motivation go up, or down, during the RLHF process, and is there a way to alter the RLHF process that would affect this outcome? The sleeper agents paper showed that it's entirely possible for a model during RLHF to get very good at concealing a motivation without it just atrophying from lack of use.

Since this question involves things the simulated persona isn't saying, only thinking, using some form of ELK, interpretability or lie detection methods to it seems clearly necessary — Anthropic's recent paper on doing that to sleeper agents after RLHF found that they;'re painfully easy to detect, which is rather reassuring. Whether that would be true for deceptive alignment during RLHF is less clear, but seems like and urgent research topic.

Comment by RogerDearnaley (roger-d-1) on Dialogue introduction to Singular Learning Theory · 2024-07-11T05:32:22.848Z · LW · GW

The thing that excites me most about SLT is the extent to which it takes things that had previously been observed and had become useful rules of thumb/folk wisdom (e.g. SGD+momentum on neural nets doesn't seem to overfit due to large parameter counts anything like as much as other smaller classes of machine learning models did), things that in many case people were previously rather puzzled by, and puts them on a solid theoretical foundation that can be explained compactly, and that also suggests where there are assumptions underlying this are that might fail under certain circumstances (e.g. if your SGD+momentum for some reason wasn't well-approximating Bayesian inference).

We would really like our Alignment engineering to be as solid and trustworthy as possible — I'm not personally hopeful that we can get all the way to machine-verified mathematical proofs of model safety (lovely as that would be), but having mathematical understanding of some of the assumptions that we're reasoning about model safety based on is a lot better then just having folk wisdom.

Comment by RogerDearnaley (roger-d-1) on Towards shutdownable agents via stochastic choice · 2024-07-11T04:54:04.199Z · LW · GW

There are types of agent (a guided missile, for example), that you want to be willing, even under some circumstances eager, to die for you. An AI-guided missile should have no preference about when or whether it's fired, but once it has been fired, it should definitely want to survive long enough to be able to hit an enemy target and explode. So you need the act of firing it to switch it from not caring at all about when it's fired, to wanting to survive just long enough to hit an enemy target. Can you alter your preference training schedule to produce a similar effect?

Or, for a more Alignment Forum like example, a corrigible AI should have no preference on when it's asked to shut down, but should have a strong preference towards shutting down as fast as is safe and practicable once it has been asked to do so: again, this requires it be indifferent to a trigger, but motivated once the trigger is received.

I assume this would require training trajectories where the pre-trigger length varied, and the AI had no preference on that, and also ones where the post-trigger length varied, and the loss function had strong opinions on that variation (and for the guided missile example, also strong opinions on whether the trajectory ended with the missile being shot down vs the missile hitting and destroying a target).

Comment by RogerDearnaley (roger-d-1) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-11T04:23:34.048Z · LW · GW

Sorry, that's an example of British understatement. I agree, it plainly isn't.

Comment by RogerDearnaley (roger-d-1) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-11T04:18:27.973Z · LW · GW

Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?

Also no, but I'm sure there are many such people reading Less Wrong/the Alignment Forum. Perhaps one or both of us should write posts outlining the issues, and see if we can get a discussion started?

Comment by RogerDearnaley (roger-d-1) on Causal Graphs of GPT-2-Small's Residual Stream · 2024-07-11T04:12:21.750Z · LW · GW

So this suggests that, if you ablate a random feature, then in contexts where that feature doesn't apply, doing so will have some (apparently random) effect on the model's emitted logits, suggesting that there is generally some crosstalk/interdependencies between features, and that to some extent "(almost) everything depends on (almost) everything else" — would that be your interpretation?

If so, that's not entirely surprising for a system that relies on only approximate orthogonality, but could be inconvenient. For example, it suggests that any security/alignment procedure that depended upon effectively ablating a large number of specific circuits (once we had identified such circuits in need of ablation) might introduce a level of noise that presumably scales with the number of circuits ablated, and might require, for example, some subsequent finetuning on a broad corpus to restore previous levels of broad model performance.?

Comment by RogerDearnaley (roger-d-1) on Consent across power differentials · 2024-07-10T10:57:01.871Z · LW · GW

Suppose that the more powerful being is aligned to the less powerful: that is to say that (as should be the case in the babysitting example you give) the more powerful being's fundamental motive is the well-being of the less powerful being.. Assume also that a lot of the asymmetry is of intellectual capacity: the more powerful being is also a great deal smarter. I think the likely and correct outcome is that there isn't always consent, the less powerful being is frequently being manipulated into actions and reactions that they haven't actually consented to, and might not even be capable of realizing why they should consent to — but ones that, if they were as intellectually capable as the more powerful being, they would in fact consent to.

I also think that,. for situations where the less powerful being is able to understand the alternatives and make an rational and informed decision, and wants to, the more powerful should give them the option and let them do so.. That's the polite, respectful way to do things But often that isn't going to be practical, or desirable. and the baby sitter should just distract the baby before they get into the dangerous situation.

Consent is a concept that fundamentally assumes that I am the best person available to make decisions about my own well-being. Outside parental situations, for interactions between evolved intelligence like humans, that's almost invariably true. But if I had a superintelligence aligned to me, then yes, I would want it to keep me away from dangers so complex that I'm not capable of making an informed decision about them.

Comment by RogerDearnaley (roger-d-1) on A Chinese Room Containing a Stack of Stochastic Parrots · 2024-07-10T09:45:44.076Z · LW · GW

For those of you not already familiar with it, the paper TinyStories: How Small Can Language Models Be and Still Speak Coherent English? makes fascinating reading. (Sadly as far as I know no-one has reproduced this research for Chinese.) They experiment with training stacks of parrots extremely small transformer models only 1-to-8 parrots transformer blocks deep, with total parameter counts in the tens of millions (with an 'm'), and show that even a single (21M parameter) stochastic parrot can speak fairly grammatical English (for a child-sized vocabulary) but keeps losing the thread of the story, while at just 2 or 4 parrots deep the models can tell only-slightly incoherent stories (with a level of non-sequiturs and plot holes roughly comparable to an actual two-or-three-year old making up a story). So on a suitably restricted vocabulary and format, and with a synthetic training set, you really don't need anything like a stack of 30–100 parrots to do as well as a rather small human child: a handful will do.

Comment by RogerDearnaley (roger-d-1) on Pantheon Interface · 2024-07-10T09:18:46.204Z · LW · GW

Advisors, interlocutors, consultants, …

Comment by RogerDearnaley (roger-d-1) on Causal Graphs of GPT-2-Small's Residual Stream · 2024-07-10T09:02:58.746Z · LW · GW

Ablating during randomly sampled openwebtext forward-passes yields basically random effects. This fits with circuit activation being quite contextual. But it's disappointing, again, that we don't see no effect whatsoever on off-distribution contexts.

This seems pretty important, and I'm not quite clear what you're saying was done, or the results were like — could you expand on this?

Comment by RogerDearnaley (roger-d-1) on Causal Graphs of GPT-2-Small's Residual Stream · 2024-07-10T08:52:18.455Z · LW · GW

-42.2%

The fact that is is all of the the previous probability of 42.2% is key here: I'd suggest normalizing this as -100% (of the previous value)

-80.7%

This is a good chunk, but not all of the previous 99.9%, so displaying it normalized as -80.6% would make this clearer.

However, the current format is probably better for the upweighted token increases.

You can always cross-reference more comprehensive interpretability data for any given dimension on Neuronpedia using those two indices.

Could you hotlink the boxes on the diagrams to that, or add the resulting content as a hover text to areas, in them or something? This might be hard to do on LW: I suspect some Javascript code might be required to do this sort of thing, but perhaps a library exists for this?

Comment by RogerDearnaley (roger-d-1) on jacquesthibs's Shortform · 2024-07-10T08:08:07.653Z · LW · GW

Go has rules, and gives you direct and definitive feedback on how well you're doing, but, while a very large space, it isn't open-ended. A lot of the foundation model companies appear to be busily thinking about doing something AlphaZero-inspired in mathematics, which also has rules, and can be arranged to give you direct feedback on how you're doing (there have been recent papers on how to make this more efficient with less human input). Similarly on writing and debugging software, likewise. Indeed, models have recently been getting better at Math and coding faster than other topics, suggesting that they're making real progress. When I watched that Dario interview (the Scandinavian bank one, I assume) my assumption was that Dario was talking about those, but using AlphaGo as a clearer and more widely-familiar example.

Expanding this to other areas seems like it would come next: robotics seems a promising one that also gives you a lot of rapid feedback, science would be fascinating and exciting but the feedback loops are a lot longer, human interactions (on something like the Character AI platform) seem like another possibility (though the result of that might be models better at human manipulation and/or pillow-talk, which might not be entirely a good thing).

Comment by RogerDearnaley (roger-d-1) on jacquesthibs's Shortform · 2024-07-10T07:56:30.367Z · LW · GW

In this @RogerDearnaley post, A "Bitter Lesson" Approach to Aligning AGI and ASI, Roger proposes training an AI on a synthetic dataset where all intelligences are motivated by the collective well-being of humanity. You are trying to bias the model to be as close to the basin of attraction for alignment as possible. In-Run Data Shapley could be used to construct such a dataset and guide the training process so that the training data best exemplifies the desired aligned behaviour.

I love this idea! Thanks for suggesting it. (It is of course, not a Bitter Lesson approach, but may well still be a great idea.)

Another area where being able to do this efficiently at scale is going to be really important is once models start showing dangerous levels of capability on WMB-dangerous chem/bio/radiological/nuclear (CBRN) and self-replication skills. The best way to deal with this is to make sure these skills aren't in the model at all, so the model can't be fine-tuned back to these capabilities (as is required to produce a model of this level where one could at least discuss open-sourcing it, rather than that being just flagrantly crazy and arguably perhaps already illegal), is to omit key knowledge from the training set entirely. Which inevitably isn't going to succeed on the first pass, but this technique applied to the first pass gives us a way to find (hopefully) everything we need to remove from the training set so we can do a second training run that has specific, focused, narrow gaps in its capabilities.

And yes, I'm interested in work in this area (my AI day-job allowing).

Comment by RogerDearnaley (roger-d-1) on Lucius Bushnaq's Shortform · 2024-07-10T07:31:10.567Z · LW · GW

I think OP just wanted some declarative code (I don't think Python is the ideal choice of language, but basically anything that's not a Turing tarpit is fine) that could speak fairly coherent English. I suspect if you had a functional transformer decompiler the results aof appling it to a Tiny Stories-size model are going to be tens to hundreds of megabytes of spaghetti, so understanding that in detail is going to be huge slog, but on the other hand, this is an actual operationalization of the Chinese Room argument (or in this case, English Room)! I agree it would be fascinating, if we can get a significant fraction of the model's perplexity score. If it is, as people seem to suspect, mostly or entirely a pile of spaghetti, understanding even a representative (frequency-of-importance biased) statistical sample of it (say, enough for generating a few specific sentences) would still be fascinating.

Comment by RogerDearnaley (roger-d-1) on Lucius Bushnaq's Shortform · 2024-07-10T07:17:54.010Z · LW · GW

Yup: the 1L model samples are full of non-sequiturs, to the level I can't imagine a human child telling a story that badly; whereas the first 2L model example has maybe one non-sequitur/plot jump (the way the story ignores the content of bird's first line of dialog), which the rest of the story then works into it so it ends up almost making sense, in retrospect (except it would have made better sense if the bear had said that line). They second example has a few non-sequiturs, but they're again not glaring and continuous the way the 1L output is. (As a parent) I can imagine a rather small human child telling a story with about the 2L level of plot inconsistencies.

Comment by RogerDearnaley (roger-d-1) on Lucius Bushnaq's Shortform · 2024-07-10T06:59:14.030Z · LW · GW

From rereading the Tiny Stories paper, the 1L model did a really bad job of maintaining the internal consistency of the story and figuring out and allowing for the logical consequences of events, but otherwise did a passably good job of speaking coherent childish English. So the choice on transformer block count would depend on how interested you are in learning how to speak English that is coherent as well as grammatical. Personally I'd probably want to look at something in the 3–4-layer range, so it has an input layer, and output layer, and at least one middle layer, and might actually contain some small circuits.

I would LOVE to have an automated way of converting a Tiny Stories-size transformer to some form of declarative language spaghetti code. It would probably help to start with a heavily-quantized version. For example, a model trained using the techniques of the recent paper on building AI using trinary logic (so roughly a 1.6-bit quantization, and eliminating matrix multiplication entirely) might be a good place to start, combined with the sort of techniques the model-pruning folks have been working on for which model-internal interactions are important on the training set and which are just noise and can be discarded.

I strongly suspect that every transformer model is just a vast pile of heuristics. In certain cases, if trained on a situation that genuinely is simple and has a specific algorithm to solve it runnable during a model forward-pass (like modular arithmetic, for example), and with enough data to grok it, then the resulting heuristic may actually be an elegant True Name algorithm for the problem. Otherwise, it's just going to be a pile of heuristics that SGD found and tuned. Fortunately SGD (for reasons that singular learning theory illuminates) has a simplicity bias that gives a prior that acts like Occam's Razor or a Kolmogorov Complexity prior, so tends to prefer algorithms that generalize well (especially as the amount of data tends to infinity, thus groking), but obviously finding True Names isn't going to be guaranteed.

Comment by RogerDearnaley (roger-d-1) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-10T06:46:50.370Z · LW · GW

I'm not sure if I'm the best person to be thinking/speculating on issues like that: I'm pretty sure I'm a better AI engineer than I am philosopher/ethicist, and there are a lot of people more familiar with the AI policy space than I am. On the other hand, I'm pretty sure I've spent longer thinking about the intersection of AI and ethics/philosophy than the great majority of AI engineers have (as in fifteen years), and few of the AI policy people that I've read have written much on the "if we solve the Alignment problem, what should we attempt align AI to, and what might the social and Realpolitique consequences of different choices be?" (And then there's the complicating question of "Are there also internal/technical/stability under reflection/philosophical constraints on that choice?" — to which I strongly suspect the short answer is "yes", even though I'm not a moral realist.) There was some discussion of this sort of stuff about 10–15 years ago on Less Wrong, but back then we knew a lot less about what sort of AI we were likely to be aligning, what its strengths and weaknesses would be, and how human-like vs alien and incomprehensible an intelligence it would be (the theoretical assumptions back then on Less Wrong tended to be more around some combination of direct construction like AIXI and/or reinforcement learning, rather than SGD token-prediction from the Internet), so we have a lot more useful information now about where the hard and easy parts are likely to be, and about the sociopolitical context.

Comment by RogerDearnaley (roger-d-1) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-10T06:25:54.827Z · LW · GW

People have already been training models on doing CoT and similar techniques, certainly via fine-tuning, and I strongly suspect also at the "significant proportion of synthetic data in the training dataset" level. My impression (from the outside) is that it's working well.

Comment by RogerDearnaley (roger-d-1) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-10T06:19:43.657Z · LW · GW

And/or, that technique might be very useful for AIs generating/editing the synthetic data.

On the subject of cost, it's also possible that we only need 10%, or 1%, or 0.1% of the dataset to illustrate the meaning of the <AI> tag, and the majority or great majority of it can be in human mode. I'm fairly sure both that more will be better, and that there will be diminishing returns from adding more, so if the alignbment tax of doing the entire dataset is too high, investigating how good results we can get with a smaller proprtion would be worth it. I believe the Pretraining Language Models with Human Preferences paper simply did the entire training set, but they were using processing that was a lot cheper to do than what I'm proposing.

Another possibility is that you'd do actually better with an expensive but very-high-quality 0.1% sample created by humans, rather than full coverage done by AI with some human input. My suspicion is that done right a human-AI combination is the way to go, but a small human dataset might be better than a large badly-AI-generated dataset.

Comment by RogerDearnaley (roger-d-1) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-10T06:11:07.649Z · LW · GW

I'm unsurprised (and relieved) to hear that other people have been thinking along similar lines — in retrospect this is a really obvious idea. It's also one whose time has come: people are already training small (few-billion parameter) models on mostly or entirely synthetic data, so it should be very doable to experiment with this alignment technique at that scale, for appropriately simple alignment goals, to see how well it works and larn more about how to make it work well — quite possibly people have already, and just not published the results yet. (I suspect the topic of synthetic training data techniques may be hot/sensitive/competitive enough that it might be a little challenging to publish a "just the alignment techniques, without the capabilities" paper on the topic.)