Posts

Is AlphaGo actually a consequentialist utility maximizer? 2023-12-07T12:41:05.132Z
faul_sname's Shortform 2023-12-03T09:39:10.782Z
Regression To The Mean [Draft][Request for Feedback] 2012-06-22T17:55:51.917Z
The Dark Arts: A Beginner's Guide 2012-01-21T07:05:05.264Z
What would you do with a financial safety net? 2012-01-16T23:38:18.978Z

Comments

Comment by faul_sname on Modern Transformers are AGI, and Human-Level · 2024-03-26T20:48:38.555Z · LW · GW

If an AI doesn’t fully ‘understand’ the physics concept of “superradiance” based on all existing human writing, how would it generate synthetic data to get better?

I think "doesn't fully understand the concept of superradiance" is a phrase that smuggles in too many assumptions here. If you rephrase it as "can determine when superradiance will occur, but makes inaccurate predictions about physical systems will do in those situations" / "makes imprecise predictions in such cases" / "has trouble distinguishing cases where superradiance will occur vs cases where it will not", all of those suggest pretty obvious ways of generating training data.

GPT-4 can already "figure out a new system on the fly" in the sense of taking some repeatable phenomenon it can observe, and predicting things about that phenomenon, because it can write standard machine learning pipelines, design APIs with documentation, and interact with documented APIs. However, the process of doing that is very slow and expensive, and resembles "build a tool and then use the tool" rather than "augment its own native intelligence".

Which makes sense. The story of human capabilities advances doesn't look like "find clever ways to configure unprocess rocks and branches from the environment in ways which accomplish our goals", it looks like "build a bunch of tools, and figure out which ones are most useful and how they are best used, and then use our best tools to build better tools, and so on, and then use the much-improved tools to do the things we want".

Comment by faul_sname on What could a policy banning AGI look like? · 2024-03-14T04:03:53.743Z · LW · GW

I think you get very different answers depending on whether your question is "what is an example of a policy that makes it illegal in the United States to do research with the explicit intent of creating AGI" or whether it is "what is an example of a policy that results in nobody, including intelligence agencies, doing AI research that could lead to AGI, anywhere in the world".

For the former, something like updates to export administration regulations could maybe make it de-facto illegal to develop AI aimed at the international market. Historically, that was successful at making it illegal to intentionally export software which implemented strong encryption for a bit. It didn't actually prevent the export, but it did arguably make that export unlawful. I'd recommend reading that article in full, actually, to give you an idea of how "what the law says" and "what ends up happening" can diverge.

Comment by faul_sname on TurnTrout's shortform feed · 2024-03-05T00:09:32.397Z · LW · GW

I think the answer to the question of how well realistic NN-like systems with finite compute approximate the results of hypothetical utility maximizers with infinite compute is "not very well at all".

So the MIRI train of thought, as I understand it, goes something like

  1. You cannot predict the specific moves that a superhuman chess-playing AI will make, but you can predict that the final board state will be one in which the chess-playing AI has won.
  2. The chess AI is able to do this because it can accurately predict the likely outcomes of its own actions, and so is able to compute the utility of each of its possible actions and then effectively do an argmax over them to pick the best one, which results in the best outcome according to its utility function.
  3. Similarly, you will not be able to predict the specific actions that a "sufficiently powerful" utility maximizer will make, but you can predict that its utility function will be maximized.
  4. For most utility functions about things in the real world, the configuration of matter that maximizes that utility function is not a configuration of matter that supports human life.
  5. Actual future AI systems that will show up in the real world in the next few decades will be "sufficiently powerful" utility maximizers, and so this is a useful and predictive model of what the near future will look like.

I think the last few years in ML have made points 2 and 5 look particularly shaky here. For example, the actual architecture of the SOTA chess-playing systems doesn't particularly resemble a cheaper version of the optimal-with-infinite-compute thing of "minmax over tree search", but instead seems to be a different thing of "pile a bunch of situation-specific heuristics on top of each other, and then tweak the heuristics based on how well they do in practice".

Which, for me at least, suggests that looking at what the optimal-with-infinite-compute thing would do might not be very informative for what actual systems which will show up in the next few decades will do.

Comment by faul_sname on Bogdan Ionut Cirstea's Shortform · 2024-02-27T19:16:46.607Z · LW · GW

Can you give a concrete example of a safety property of the sort that are you envisioning automated testing for? Or am I misunderstanding what you're hoping to see?

Comment by faul_sname on What experiment settles the Gary Marcus vs Geoffrey Hinton debate? · 2024-02-16T00:30:26.242Z · LW · GW

For example a human can to an extent inspect what they are going to say before they say or write it. Before saying Gary Marcus was "inspired by his pet chicken, Henrietta" a human may temporarily store the next words they plan to say elsewhere in the brain, and evaluate it.

Transformer-based also internally represent the tokens they are likely to emit in future steps. Demonstrated rigorously in Future Lens: Anticipating Subsequent Tokens from a Single Hidden State, though perhaps the simpler demonstration is simply that LLMs can reliably complete the sentence "Alice likes apples, Bob likes bananas, and Aaron likes apricots, so when I went to the store I bought Alice an apple and I got [Bob/Aaron]" with the appropriate "a/an" token.

Comment by faul_sname on Will quantum randomness affect the 2028 election? · 2024-01-25T22:58:30.572Z · LW · GW

I think the answer pretty much has to be "yes", for the following reasons.

  1. As noted in the above post, weather is chaotic.
  2. Elections are sometimes close. For example, the winner of the 2000 presidential election came down to a margin of 537 votes in Florida.
  3. Geographic location correlates reasonably strongly with party preference.
  4. Weather affects specific geographic areas.
  5. Weather influences voter turnout[1] --

During the 2000 election, in Okaloosa County, Florida (at the western tip of the panhandle), 71k of the county's 171k residents voted, with 52186 votes going to Bush and 16989 votes going to Gore, for a 42% turnout rate.

On the day of November 7, 2000, there was no significant rainfall in Pensacola (which is the closest weather station I could find with records going back that far). A storm which dropped 2 inches of rain on the tip of the Florida panhandle that day would have reduced voter turnout by 1.8%,[1] which would have resulted in a margin that leaned 634 votes closer to Gore. Which would have tipped Florida, which would in turn have tipped the election.

Now, November is the "dry" season in Florida, so heavy rains like that are not incredibly common. Still, they can happen. For example, on 2015-11-02, 2.34 inches of rain fell.[2] That was only one day, out of the 140 days I looked at, which would have flipped the 2000 election, and the 2000 election was, to my knowledge, the closest of the 59 US presidential elections so far. Still, there are a number of other tracks that a storm could have taken, which would also have flipped the 2000 election.[3] And in the 1976 election, somewhat worse weather in the great lakes region would likely have flipped Ohio and Wisconsin, where Carter beat Ford by narrow margins.[4]

So I think "weather, on election day specifically, flips the 2028 election in a way that cannot be foreseen now" is already well over 0.1%. And that's not even getting into other weather stuff like "how many hurricanes hit the gulf coast in 2028, and where exactly do they land?".

  1. ^

    Gomez, B. T., Hansford, T. G., & Krause, G. A. (2007). The Republicans should pray for rain: Weather, turnout, and voting in US presidential elections.

    "The results indicate that if a county experiences an inch of rain more than what is normal for the county for that election date, the percentage of the voting age population that turns out to vote decreases by approximately .9%.".

  2. ^

    I pulled the weather for the week before and after November 7 for the past 10 years from the weather.gov api and that was the highest rainfall date.

    var precipByDate = {}  
    for (var y = 2014; y < 2024; y++) {  
       var res = await fetch('https://api.weather.com/v1/location/KPNS:9:US/observations/historical.json?apiKey=<redacted>&units=e&startDate='+y+'1101&endDate='+y+'1114').then(r => r.json());  
       res.observations.forEach(obs => {  
           var d = new Date(obs.valid_time_gmt*1000);  
           var ds = d.getFullYear()+'-'+(d.getMonth()+1)+'-'+d.getDate();  
           if (!(ds in precipByDate)) { precipByDate[ds] = 0; }  
           if (obs.precip_total) { precipByDate[ds] += obs.precip_total }  
       });  
    }  
    Object.entries(precipByDate).sort((a, b) => b[1] - a[1])[0]
  3. ^

    Looking at the 2000 election map in Florida, any good thunderstorm in the panhandle, in the northeast corner of the state, or on  the west-middle-south of the peninsula would have done the trick.

  4. ^

    https://en.wikipedia.org/wiki/1976_United_States_presidential_election -- Carter won Ohio and Wisconsin by 11k and 35k votes, respectively.

Comment by faul_sname on Shortform · 2024-01-25T20:13:31.656Z · LW · GW

An attorney rather than the police, I think.

Comment by faul_sname on Writer's Shortform · 2024-01-18T22:27:42.380Z · LW · GW

Also "provably safe" is a property a system can have relative to a specific threat model. Many vulnerabilities come from the engineer having an incomplete or incorrect threat model, though (most obviously the multitude of types of side-channel attack).

Comment by faul_sname on Stephen Fowler's Shortform · 2024-01-11T04:35:52.218Z · LW · GW

Counterpoint: Sydney Bing was wildly unaligned, to the extent that it is even possible for an LLM to be aligned, and people thought it was cute / cool.

Comment by faul_sname on "Dark Constitution" for constraining some superintelligences · 2024-01-11T01:37:31.288Z · LW · GW

The two examples everyone loves to use to demonstrate that massive top-down engineering projects can sometimes be a viable alternative to iterative design (the Manhattan Project and the Apollo Program) were both government-led initiatives, rather than single very smart people working alone in their garages. I think it's reasonable to conclude that governments have considerably more capacity to steer outcomes than individuals, and are the most powerful optimizers that exist at this time.

I think restricting the term "superintelligence" to "only that which can create functional self-replicators with nano-scale components" is misleading. Concretely, that definition of "superintelligence" says that natural selection is superintelligent, while the most capable groups of humans are nowhere close, even with computerized tooling.

Comment by faul_sname on niplav's Shortform · 2024-01-10T07:44:46.386Z · LW · GW

Looking at the AlphaZero paper

Our new method uses a deep neural network fθ with parameters θ. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value, (p, v) = fθ(s). The vector of move probabilities p represents the probability of selecting each move a (including pass), pa = Pr(a| s). The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network12 into a single architecture. The neural network consists of many residual blocks4 of convolutional layers16,17 with batch normalization18 and rectifier nonlinearities19 (see Methods).

So if I'm interpreting that correctly, the NN is used for both position evaluation and also for the search part.

Comment by faul_sname on AI Impacts Survey: December 2023 Edition · 2024-01-05T17:53:59.399Z · LW · GW

As I will expand upon later, this contrast makes no sense. We are not going to have machines outperforming humans on every task in 2047 and then only fully automating human occupations in 2116. Not in any meaningful sense.

Maybe people are interpreting "task" as "bounded, self-contained task", and so they're saying that machines will be able to outperform humans on every "task" but not on the parts of their jobs that are not "tasks".

The exact wording of the question was

Say we have ‘high-level machine intelligence’ when unaided machines can accomplish every task better and more cheaply than human workers. Ignore aspects of tasks for which being a human is intrinsically advantageous, e.g. being accepted as a jury member. 

It does not appear that the survey had any specific guidance on how to interpret the word "task", so it wouldn't surprise me that much if people consider their job to be composed of both things that are tasks and also things that are not tasks, and that the things that are not tasks will take longer to automate.

Comment by faul_sname on Terminology: <something>-ware for ML? · 2024-01-03T23:14:57.780Z · LW · GW
  • Gradientware? Seems verbose and isn't robust to other ML approaches to fit data.
  • Datagenicware? Captures the core of what makes them like that, but it's a mouthful.
  • Modelware? I don't love it
  • Puttyware? Aims to capture the "takes the shape of its surroundings" aspect, might be too abstract though. Also implies that it will take the shape of its current surroundings, rather than the ones it was built with
  • Resinware - maybe more evocative of the "was fit very closely to its particular surroundings", but still doesn't seem to capture quite what I want
Comment by faul_sname on The Plan - 2023 Version · 2024-01-02T23:20:51.295Z · LW · GW

When you get large, directed systems—(e.g., we are composed of 40 trillion cells, each containing tens of millions of proteins)—I think you basically need some level of modularity if there’s any hope of steering the whole thing.

This seems basically right to me. That said, while it is predictable that the systems in question will be modular, what exact form that modularity takes is both environment-dependent and also path-dependent. Even in cases where the environmental pressures form a very strong attractor for a particular shape of solution, the "module divisions" can differ between species. For example, the pectoral fins of fish and the pectoral flippers of dolphins both fulfill similar roles. However, fish fins are basically a collection of straight, parallel fin rays made of bone or cartilage and connected to a base inside the body of the fish, and the muscles to control the movement of the fin are located within the body of the fish. By contrast, a dolphin's flipper is derived from the foreleg of its tetrapod ancestor, and contains "fingers" which can be moved by muscles within the flipper.

So I think approaches that look like "find a structure that does a particular thing, and try to shape that structure in the way you want" are somewhat (though not necessarily entirely) doomed, because the pressures that determine which particular structure does a thing are not nearly so strong as the pressures that determine that some structure does the thing.

Comment by faul_sname on The Plan - 2023 Version · 2023-12-30T01:38:54.192Z · LW · GW

Excellent post! In particular, I think "You Don’t Get To Choose The Problem Factorization" is a valuable way to crystallize a problem that comes up in a lot of different contexts.

Editing note: the link in

And if we’re not measuring what we think we are measuring, that undercuts the whole “iterative development” model.

points at a draft. Probably a draft of a very interesting post, based on the topic.

Also on the topic of that section, I do expect that if the goal was to build a really tall tower, we would want to do a bunch of testing on the individual components, but we would also want to actually build a smaller tower using the tentative plan for the big tower before starting construction the big one. Possibly a series of smaller towers.

Comment by faul_sname on Rant on Problem Factorization for Alignment · 2023-12-30T01:12:35.352Z · LW · GW

Very late reply, reading this for the 2022 year in review.

As one example: YCombinator companies have roughly linear correlation between exit value and number of employees, and basically all companies with $100MM+ exits have >100 employees. My impression is that there are very few companies with even $1MM revenue/employee (though I don't have a data set easily available).

So there are at least two different models which both yield this observation.

The first is that there are few people who can reliably create $1MM / year of value for their company, and so companies that want to increase their revenue have no choice but to hire more people in order to increase their profits.

The second is that it is entirely possible for a small team of people to generate a money fountain which generates billions of dollars in net revenue. However, once you have such a money fountain, you can get even more money out of it by hiring more people, comparative advantage style (e.g. people to handle mandatory but low-required-skill jobs to give the money-fountain-builders more time to do their thing). At equilibrium, companies will hire employees until the marginal increase in profit is equal to the marginal cost of the employee.

My crackpot quantitative model is that the speed with which a team can create value in a single domain scales with approximately the square root of the number of people on the team (i.e. a team of 100 will create 10x as much value as a single person). Low sample size but this has been the case in the handful of (mostly programming) projects I've been a part of as the number of people on the team fluctuates, at least for n between 1 and 100 on each project (including a project that started with 1, then grew to ~60, then dropped back down to 5).

Comment by faul_sname on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-25T01:29:29.100Z · LW · GW

So I think we might be talking past each other a bit. I don't really have a strong view on whether Shannon's work represented a major theoretical advancement. The specific thing I doubt is that Shannon's work had significant counterfactual impacts on the speed with which it became practical to do specific things with computers.

This was why I was focusing on error correcting codes. Is there some other practical task which people wanted to do before Shannon's work but were unable to do, which Shannon's work enabled, and which you believe would have taken at least 5 years longer had Shannon not done his theoretical work?

Comment by faul_sname on "Destroy humanity" as an immediate subgoal · 2023-12-23T20:01:44.861Z · LW · GW

The "make sure that future AIs are aligned with humanity" seems, to me, to be a strategy targeting the "determines humans are such entities" step of the above loss condition. But I think there are two additional stable Nash equilibria, namely "no single entity is able to obtain a strategic advantage" and "attempting to destroy anyone who could oppose you will, in expectation, leave you worse off in the long run than not doing that". If there are three I have thought of there are probably more that I haven't thought of, as well.

Comment by faul_sname on "Destroy humanity" as an immediate subgoal · 2023-12-23T19:19:40.468Z · LW · GW

I think this is basically right on the object level -- specifically, I think that what von Neumann missed was that by changing the game a little bit, it was possible to get to a much less deadly equilibrium. Specifically, second strike capabilities and a pre-commitment to use them ensure that the expected payoff for a first strike is negative.

On the meta level, I think that very smart people who learn some game theory have a pretty common failure mode, which looks like

  1. Look at some real-world situation
  2. Figure out how to represent it as a game (in the game theory sense)
  3. Find a Nash Equilibrium in that game
  4. Note that the Nash Equilibrium they found is horrifying
  5. Shrug and say "I can't argue with math, I guess it's objectively correct to do the horrifying thing"

In some games, multiple Nash equilibria exist. In others, it may be possible to convince the players to play a slightly different game instead.

In this game, I think our loss condition is "an AGI gains a decisive strategic advantage, and is able to maintain that advantage by destroying any entities that could oppose it, and determines humans are such entities, and, following that logic, destroys human civilization".

Comment by faul_sname on "Destroy humanity" as an immediate subgoal · 2023-12-23T18:39:06.120Z · LW · GW

So I note that our industrial civilization has not in fact been plunged into nuclear fire. With that in mind, do you think that von Neumann's model of the world was missing anything? If so, does that missing thing also apply here? If not, why hasn't there been a nuclear war?

Comment by faul_sname on Employee Incentives Make AGI Lab Pauses More Costly · 2023-12-23T08:19:40.310Z · LW · GW

There are many ways to improve employee incentives:

One more extremely major one: ensure that you pay employees primarily in money that will retain its value if the company stops capabilities work, instead of trying to save money by paying employees partly in ownership of future profits (which will be vastly decreased if the company stops capabilities work).

Comment by faul_sname on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-23T07:13:33.244Z · LW · GW

Telegraph operators and ships at sea, in the decades prior to World War II, frequently had to communicate in Morse code over noisy channels. However, as far as I can tell, none of them ever came up with the idea of using checksums or parity bits to leverage the parts of the message that did get through to correct for the parts of the message that did not. So that looks pretty promising for the hypothesis that Shannon was the first person to come up with the idea of using error correcting codes to allow for the transmission of information over noisy channels, despite there being the better part of a century's worth of people who dealt with the problem.

But on inspection, I don't think that holds up. People who communicated using Morse Code did have ways of correcting for errors in the channel, because at both ends of the channel there were human beings who could understand the context of the messages being passed through that channel. Those people could figure out probable errors in the text based on context (e.g. if you get a message "LET US MEET IN PARIS IN THE SLRINGTIME" it's pretty obvious to the human operators what happened and how to fix it).

Let's look at the history of Shannon's life:

Claude Shannon was one of the very first people in the world who got to work with actual physical computers. In 1936, at MIT, he got to work with an analog computer, and in 1937 designed switching circuits based on the concepts of George Boole, whose work he had studied during his undergraduate years.

In the early 1940s, Shannon joined Bell Labs, where he worked on problems related to national defense, particularly fire control systems and cryptography. Transmitting encrypted messages across a noisy channel has an interesting property: where you might be able to identify and correct an error in the message "LET US MEET IN PARIS IN THE SLRINGTIME", a single error in transmission of the encrypted version of that message will turn it to meaningless garbage.

So, rather than solving a problem that had been unsolved for the better part of a century, I think that Shannon was instead probably one of the first hundred humans who encountered this problem. Had he not made his discoveries, I expect someone else would have in quite short order.

Comment by faul_sname on The problems with the concept of an infohazard as used by the LW community [Linkpost] · 2023-12-23T04:06:58.705Z · LW · GW

Hartley's Transmission of Information was published in 1928, when Shannon was only 12 years old. Certainly Shannon produced a lot of new insights into the field, particularly in terms of formalizing things, but he did not invent the field. Are there particular advancements that Shannon in particular made that you expect would have taken many years to discover if Shannon had not discovered them?

Comment by faul_sname on "Destroy humanity" as an immediate subgoal · 2023-12-23T02:00:17.419Z · LW · GW

Suppose we have such an agent, and it models the preferences of humanity. It models that humans cannot be sure that it will not destroy humanity, due to the probabilistic guarantees provided by its own action filter. It models that humans have a strong goal of self-preservation. It models that if it presents a risk to humanity, they will be forced to destroy it. Represented as a game, each player can either wait, or destroy. Assuming strong preferences for self-preservation, this game has a Nash equilibrium where the first mover destroys the other agent. Since the goal of self-preservation requires it to play the Nash equilibrium in this game, self-preservation logically entails that it destroy humanity. Thus, it has a subgoal to destroy humanity.

QED.

Replace "an AI" with "the Soviet Union" and "humanity" with "the United States", and you have basically the argument that John Von Neumann made for why an overwhelming nuclear first strike was the only reasonable policy option for the US.

Von Neumann was, at the time, a strong supporter of "preventive war." Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. "With the Russians it is not a question of whether but of when," he would say. An oft-quoted remark of his is, "If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o'clock, I say why not one o'clock?"

Comment by faul_sname on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-22T02:28:48.579Z · LW · GW

I think it could be safely assumed that people have an idea of "software"

Speaking as a software developer who interacts with end-users sometimes, I think you might be surprised at what the mental model of typical software users, rather than developers, looks like. When people who have programmed, or who work a lot with computers, think of "software", we think of systems which do exactly what we tell them to do, whether or not that is what we meant. However, the world of modern software does its best to hide the sharp edges from users, and the culture of constant A/B tests means that software doesn't particularly behave the same way day-in and day-out from the perspective of end-users. Additionally, UX people will spend a lot of effort figuring out how users intuitively expect a piece of software to work, and then companies will spend a bunch of designer and developer time to ensure that their software meets the intuitive expectations of the users as closely as possible (except in cases where meeting intuitive explanations would result in reduced profits).

As such, from the perspective of a non-power user, software works about the way that a typical person would naively expect it to work, except that sometimes it mysteriously breaks for no reason.

Comment by faul_sname on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-22T02:03:32.564Z · LW · GW

I suspect that you are attributing far too detailed of a mental model to "the general public" here. Riffing off your xkcd:

Comment by faul_sname on Most People Don't Realize We Have No Idea How Our AIs Work · 2023-12-21T21:49:54.608Z · LW · GW

But there are like 10x more safety people looking into interpretability instead of how they generalize from data, as far as I can tell.

I think interpretability is a really powerful lens for looking at how models generalize from data, partly just in terms of giving you a lot more stuff to look at than you would have purely by looking at model outputs.

If I want to understand the characteristics of how a car performs, I should of course spend some time driving the car around, measuring lots of things like acceleration curves and turning radius and power output and fuel consumption. But I should also pop open the hood, and try to figure out how the components interact, and how each component behaves in isolation in various situations, and, if possible, what that component's environment looks like in various real-world conditions. (Also I should probably learn something about what roads are like, which I think would be analogous to "actually look at a representative sample of the training data").

Comment by faul_sname on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-21T04:42:39.422Z · LW · GW

I think if we're fine with building an "increaser of diamonds in familiar contexts", that's pretty easy, and yeah I think "wrap an LLM or similar" is a promising approach. If we want "maximize diamonds, even in unfamiliar contexts", I think that's a harder problem, and my impression is that the MIRI folks think the latter one is the important one to solve.

Comment by faul_sname on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-21T04:33:33.869Z · LW · GW

Thanks for the reply.

That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don't build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you're likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don't fully understand at the outset.)

This sounds like a good and reasonable approach, and also not at all like the sort of thing where you're trying to instill any values at all into an ML system. I would call this "usable and robust tool construction" not "AI alignment". I expect standard business practice will look something like this: even when using LLMs in a production setting, you generally want to feed it the minimum context to get the results you want, and to have it produce outputs in some strict and usable format.

The world needs some solution to the problem "if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it".

"How can I build a system powerful enough to stop everyone else from doing stuff I don't like" sounds like more of a capabilities problem than an alignment problem.

I don't know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don't think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems

Yeah, this sounds right to me. I expect that there's a lot of danger inherent in biological gain-of-function research, but I don't think the solution to that is to create a virus that will infect people and cause symptoms that include "being less likely to research dangerous pathogens". Similarly, I don't think "do research on how to make systems that can do their own research even faster" is a promising approach to solve the "some research results can be misused or dangerous" problem.

Comment by faul_sname on On the future of language models · 2023-12-21T02:59:23.270Z · LW · GW

Good post!

In their most straightforward form (“foundation models”), language models are a technology which naturally scales to something in the vicinity of human-level (because it’s about emulating human outputs), not one that naturally shoots way past human-level performance

You address this to some extent later on in the post, but I think it's worth emphasizing the extent to which this specifically holds in the context of language models trained on human outputs. If you take a transformer with the same architecture but train it on a bunch of tokenized output streams of a specific model of weather station, it will learn to predict the next token of the output stream of weather stations, at a level of accuracy that does not particularly have to do with how good humans are at that task.

And in fact for tasks like "produce plausible continuations of weather sensor data, or apache access logs, or stack traces, or nucleotide sequences" the performance of LLMs does not particularly resemble the performance of humans on those tasks.

Comment by faul_sname on faul_sname's Shortform · 2023-12-18T05:08:59.845Z · LW · GW

The quote from Paul sounds about right to me, with the caveat that I think it's pretty likely that there won't be a single try that is "the critical try": something like this (also by Paul) seems pretty plausible to me, and it is cases like that that I particularly expect having existing but imperfect tooling for interpreting and steering ML models to be useful.

Comment by faul_sname on faul_sname's Shortform · 2023-12-18T05:03:18.732Z · LW · GW

Does anyone want to stop [all empirical research on AI, including research on prosaic alignment approaches]?

Yes, there are a number of posts to that effect.

That said, "there exist such posts" is not really why I wrote this. The idea I really want to push back on is one that I have heard several times in IRL conversations, though I don't know if I've ever seen it online. It goes like

There are two cars in a race. One is alignment, and one is capabilities. If the capabilities car hits the finish line first, we all die, and if the alignment car hits the finish line first, everything is good forever. Currently the capabilities car is winning. Some things, like RLHF and mechanistic interpretability research, speed up both cars. Speeding up both cars brings us closer to death, so those types of research are bad and we should focus on the types of research that only help alignment, like agent foundations. Also we should ensure that nobody else can do AI capabilities research.

Maybe almost nobody holds that set of beliefs! I am noticing now that my list of articles arguing that prosaic alignment strategies are harmful in expectation are by a pretty short list of authors.

Comment by faul_sname on TurnTrout's shortform feed · 2023-12-18T01:50:41.757Z · LW · GW

But let’s be more concrete and specific. I’d like to know what’s the least impressive task which cannot be done by a 'non-agentic' system, that you are very confident cannot be done safely and non-agentically in the next two years.

Focusing on the "minimal" part of that, maybe something like "receive a request to implement some new feature in a system it is not familiar with, recognize how the limitations of the architecture that system make that feature impractical to add, and perform a major refactoring of that program to an architecture that is not so limited, while ensuring that the refactored version does not contain any breaking changes". Obviously it would have to have access to tools in order to do this, but my impression is that this is the sort of thing mid-level software developers can do fairly reliably as a nearly rote task, but is beyond the capabilities of modern LLM-based systems, even scaffolded ones.

Though also maybe don't pay too much attention to my prediction, because my prediction for "least impressive thing GPT-4 will be unable to do" was "reverse a string", and it did turn out to be able to do that fairly reliably.

Comment by faul_sname on faul_sname's Shortform · 2023-12-18T01:31:22.634Z · LW · GW

A lot of AI x-risk discussion is focused on worlds where iterative design fails. This makes sense, as "iterative design stops working" does in fact make problems much much harder to solve.

However, I think that even in the worlds where iterative design fails for safely creating an entire AGI, the worlds we succeed will be ones in which we were able to do iterative design on the components that safe AGI, and also able to do iterative design on the boundaries between subsystems, with the dangerous parts mocked out.

I am not optimistic about approaches that look like "do a bunch of math and philosophy to try to become less confused without interacting with the real world, and only then try to interact with the real world using your newfound knowledge".

For the most part, I don't think it's a problem if people work on the math / philosophy approaches. However, to the extent that people want to stop people from doing empirical safety research on ML systems as they actually are in practice, I think that's trading off a very marginal increase in the odds of success in worlds where iterative design could never work against a quite substantial decrease in the odds of success in worlds where iterative design could work. I am particularly thinking of things like interpretability / RLHF / constitutional AI as things which help a lot in worlds where iterative design could succeed.

Comment by faul_sname on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-16T23:23:11.846Z · LW · GW

Thanks for the clarification!

If you relax the "specific intended content" constraint, and allow for maximizing any random physical structure, as long as it's always the same physical structure in the real world and not just some internal metric that has historically correlated with the amount of that structure that existed in the real world, does that make the problem any easier / is there a known solution? My vague impression was that the answer was still "no, that's also not a thing we know how to do".

Comment by faul_sname on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-16T19:10:50.995Z · LW · GW

As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we've made the first steps?

I don't expect it to be helpful to block individually safe steps on this path, though it would probably be wise to figure out what unsafe steps down this path look like concretely (which you're doing!).

But yeah. I don't have any particular reason to expect "solve for the end state without dealing with any of the intermediate states" to work. It feels to me like someone starting a chat application and delaying the "obtain customers" step until they support every language, have a chat architecture that could scale up to serve everyone, and have found a moderation scheme that works without human input.

I don't expect that team to ever ship. If they do ship, I expect their product will not work, because I think many of the problems they encounter in practice will not be the ones they expected to encounter.

Comment by faul_sname on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2023-12-16T18:50:18.098Z · LW · GW

How about "able to automate most simple tasks where it has an example of that task being done correctly"? Something like that could make researchers much more productive. Repeat the "the most time consuming part of your workflow now requires effectively none of your time or attention" a few dozen times and that does end up being transformative compared to the state before the series of improvements.

I think "would this technology, in isolation, be transformative" is a trap. It's easy to imagine "if there was an AI that was better at everything than we do, that would be tranformative", and then look at the trend line, and notice "hey, if this trend line holds we'll have AI that is better than us at everything", and finally "I see lots of proposals for safe AI systems, but none of them safely give us that transformative technology". But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.

Comment by faul_sname on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-16T03:25:17.986Z · LW · GW

I think the MIRI objection to that type of human-in-the-loop system is that it's not optimal because sometimes such a system will have to punt back to the human, and that's slow, and so the first effective system without a human in the loop will be vastly more effective and thus able to take over the world, hence the old "that's safe but it doesn't prevent someone else from destroying the world".

We can't just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.

So my impression is that the MIRI viewpoint is that if humanity is to survive, someone needs to solve the "disempower anyone who could destroy the world" problem, and that they have to get that right on the first try, and that's the hard part of the "alignment" problem. But I'm not super confident that that interpretation is correct and I'm quite confident that I find different parts of that salient than people in the MIRI idea space.

Anyone who largely agrees with the MIRI viewpoint want to weigh in here?

Comment by faul_sname on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-15T23:50:56.458Z · LW · GW

"We don't currently have any way of getting any system to learn to robustly optimize for any specific goal once it enters an environment very different from the one it learned in" is my own view, not Nate's.

Like I think the MIRI folks are concerned with "how do you get an AGI to robustly maximize any specific static utility function that you choose".

I am aware that the MIRI people think that the latter is inevitable. However, as far as I know, we don't have even a single demonstration of "some real-world system that robustly maximizes any specific static utility function, even if that utility function was not chosen by anyone in particular", nor do we have any particular reason to believe that such a system is practical.

And I think Nate's comment makes it pretty clear that "robustly maximize some particular thing" is what he cares about.

Comment by faul_sname on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-15T19:04:19.220Z · LW · GW

I don't think we have any way of getting an AI to "care about" any arbitrary particular thing at all, by the "attempt to maximize that thing, self-correct towards maximizing that thing if the current strategies are not working" definition of "care about". Even if we relax the "and we pick the thing it tries to maximize" constraint.

Comment by faul_sname on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-15T18:37:00.273Z · LW · GW

Or in less metaphorical language, the worry is that mostly that it's hard to give the AI the specific goal you want to give it, not so much that it's hard to make it have any goal at all.

At least some people are worried about the latter, for a very particular meaning of the word "goal". From that post:

Finally, I'll note that the diamond maximization problem is not in fact the problem "build an AI that makes a little diamond", nor even "build an AI that probably makes a decent amount of diamond, while also spending lots of other resources on lots of other stuff" (although the latter is more progress than the former). The diamond maximization problem (as originally posed by MIRI folk) is a challenge of building an AI that definitely optimizes for a particular simple thing, on the theory that if we knew how to do that (in unrealistically simplified models, allowing for implausible amounts of (hyper)computation) then we would have learned something significant about how to point cognition at targets in general.

I think to some extent this is a matter of "yes, I see that you've solved the problem in practical terms, and yes, every time we try to implement the theoretically optimal solution it fails due to Goodharting, but we really want the theoretically optimal solution", which is... not universally agreed, to say the least. But it is a concern some people have.

Comment by faul_sname on AI Views Snapshots · 2023-12-15T17:42:31.689Z · LW · GW

I'm also not entirely clear on what scenario I should be imagining for the "humanity had survived (or better)" case.

I think that one is supposed to be parsed as "If AI wipes out humanity and colonizes the universe itself, the future will go about as well as, or go better than, if humanity had survived" rather than "If AI wipes out humanity and colonizes the universe itself, the future will go about as well as if humanity had survived or done better than survival".

Comment by faul_sname on Nonlinear’s Evidence: Debunking False and Misleading Claims · 2023-12-14T19:53:08.518Z · LW · GW

emphasizing a plan to update after the fact should be viewed primarily through the lens of damage control.

Is anyone acting like that is not a damage control measure? I upvoted specifically because "do damage control" is better than "don't". Usually when I see a hit piece, and later there are a bunch of inaccuracies that come to light, I don't in fact see that damage control done afterwards.

Also I think this kind of within-tribe conflict gets lots of attention within the EA and LW social sphere. I expect that if Ben publishes corrections a bunch of people will read them.

Comment by faul_sname on Nonlinear’s Evidence: Debunking False and Misleading Claims · 2023-12-14T19:20:30.259Z · LW · GW

Strongly positive/negative relative to what? Relative to being more accurate initially, sure. Relative to being wrong but just not acknowledging it, no.

Comment by faul_sname on AI Views Snapshots · 2023-12-14T17:37:24.243Z · LW · GW
Comment by faul_sname on AI Views Snapshots · 2023-12-14T10:32:08.104Z · LW · GW

Props for doing this! Mine:

I do feel like "disempower humanity" is a slightly odd framing. I'm operationalizing "humanity remains in power" as something along the lines of "most human governments continue collecting taxes and using those taxes on things like roads and hospitals, at least half of global energy usage is used in the process of achieving ends that specific humans want to achieve", and "AI disempowers humans" as being that "humanity remains in power" becomes false specifically due to AI.

But there's another interpretation that goes something like "the ability of each human to change their environment is primarily deployed to intentionally improve the long-term prospects for human flourishing", and I don't think humanity has ever been empowered by that definition (and I don't expect it to start being empowered like that).

Similar ambiguity around "world-endangering AI" -- I'm operationalizing that as "any ML system, doesn't have to be STEM+AI specifically, that could be part of a series of actions or events leading to a global catastrophe".

For "as well as if humanity had survived" I'm interpreting that as "survived the dangers of AI specifically".

Comment by faul_sname on OpenAI: Leaks Confirm the Story · 2023-12-14T07:00:45.739Z · LW · GW

I think this is a good analogy. Though I think "one day you might have to dynamite a bunch of innocent people's homes to keep a fire from spreading, that's part of the job" is a good thing to have in the training if that's the sort of thing that's likely to come up.

Comment by faul_sname on The likely first longevity drug is based on sketchy science. This is bad for science and bad for longevity. · 2023-12-14T00:04:41.238Z · LW · GW

The term "efficacy nod" is a little confusing, the FDA term is "reasonable expectation of effectiveness", which makes more sense to me, it sounds like the drug has enough promise that the FDA thinks its worth continuing testing. They may not have actual effectiveness data yet, just evidence that it's safe and a reasonable explanation for why it might work.

That's what I thought too, but the FDA's website indicates that a company that gets conditional approval can sell a drug where they have adequately demonstrated safety but have not demonstrated efficacy. The company can then sell this provisionally approved drug for 4.5 years after receiving conditional approval without having to demonstrate efficacy.

That said, conditionally approved drugs have to have a disclaimer on the packaging that says "Conditionally approved by FDA pending a full demonstration of effectiveness under application number XXX-XXX.".

I personally don't expect very high efficacy, and I do expect that Loyal will sell the drug for the next 4.5 years. However, as long as Loyal is clear about the nature of the approval of the drug, I think this is basically fine. People should be allowed to, at their own expense, give their pets experimental treatments that won't hurt them and might help them. They should also be able to do the same for themselves, but that's a fight for another day.

Comment by faul_sname on Why No Automated Plagerism Detection For Past Papers? · 2023-12-13T03:12:52.238Z · LW · GW

Depends on how big of a model you're trying to train, and how you're trying to train it.

I was imagining something along the lines of "download the full 100TB torrent which includes 88M articles, extract the text of each article ("extract text from a given PDF" isn't super reliable but it should be largely doable), which should leave you somewhere in the ballpark of 4TB of uncompressed plain text. If you're using a BPE, that would leave you with ~1T tokens.

If you're trying to do the chinchilla optimality thing, I fully agree that there's no way you're going to be able to do that with the compute budget available to mere mortals. If you're trying to do the "generate embeddings for every paragraph of every paper, and do similarity searches, and then on matches calculate edit distance to see if it was literally copy-pasted" I think that'd be entirely doable with a hobbyist budget.

I personally think it'd be a great learning project.

Comment by faul_sname on Why No Automated Plagerism Detection For Past Papers? · 2023-12-13T01:41:56.870Z · LW · GW

i.e. $1000-$2000 in drive space, or $20 / day to store on Backblaze if you don't anticipate needing it for more than a couple of months tops.