Regression To The Mean [Draft][Request for Feedback] 2012-06-22T17:55:51.917Z
The Dark Arts: A Beginner's Guide 2012-01-21T07:05:05.264Z
What would you do with a financial safety net? 2012-01-16T23:38:18.978Z


Comment by faul_sname on Focus on the Hardest Part First · 2023-09-15T03:23:47.794Z · LW · GW

Your alignment example is very strange. C is basically "Solve Alignment" whereas A and B taken together do not constitute an alignment solution at all.

That is rather my point, yes. A solution to C would be most of the way to a solution to AI alignment , and would also solve management / resource allocation / corruption / many other problems. However, as things stand now a rather significant fraction of the entire world economy is directed towards mitigating the harms caused by the inability of principals to fully trust agents to act in their best interests. As valuable as a solution to C would be, I see no particular reason to expect it to be possible, and an extremely high lower bound on the difficulty of the problem (most salient to me is the multi-trillion-dollar value that could be captured from someone who did robustly solve this problem).

I wish anyone who decides to tackle C the best of luck, but I expect that the median outcome of such work would be something of no value, and the 99th percentile outcome of such work would be something like a clever incentive mechanism which e.g. bounds the extent to which the actions of an agent can harm the principal while appearing to be helpful to zero in a perfect-information world, and which degrades gracefully in a world of imperfect information.

In the meantime, I expect that attacking subproblems A and B will have a nonzero amount of value even in worlds where nobody finds a robust solution to subproblem C (both because better versions of A and B may be able to shore up an imperfect or partial solution to C, and also because while robust solutions to A/B/C may be one sufficient set of components to solve your overall problem, they may not be the only sufficient set of components, and by having solutions to A and B you may be able to find alternatives to C).

In alignment, if you don't solve the problem you die. You can't solve alignment 90% and then deploy an AI build with this 90% level of understanding because then the AI will still be approximately 0% aligned and kill you.

Whether this is true (in the narrow sense of "the AI kills you because it made an explicit calculation and determined that the optimal course of action was to perform behaviors that result in your death" rather than in the broad sense of "you die for reasons that would still have killed you even in the absence of the particular AI agent we're talking about") is as far as I know still an open question, and it seems to me one where preliminary signs are pointing against a misaligned singleton being the way we die.

The crux might be whether you expect a single recursively-self-improving agent to take control of the light cone, or whether you don't expect a future where any individual agent can unilaterally determine the contents of the light cone.

Thank you for telling me about the CAP problem, I did not know about it.

For reference the CAP problem is one of those problems that sounds worse in theory than it actually is in practice: for most purposes -- significant partitions are rare, and " when a partition happens, data may not be fully up-to-date, and writes may be dropped or result in conflicts" is usually a good-enough solution in those rare cases.

Comment by faul_sname on Focus on the Hardest Part First · 2023-09-12T00:28:16.701Z · LW · GW

I assert that if you have subproblems A and B which are tractable and actionable, and subproblem C which is a scary convoluted mess where you wouldn't even know where to start, that is not an indication that you should jump right into trying to solve subproblem C. Instead, I assert that this is an indication that you should take a good hard look at what you're actually trying to accomplish and figure out

  1. Is there any particular reason to expect this problem as a whole to be solvable at all, even in principle?
  2. Is there a way to get most of the value of solving the entire problem without attacking the hardest part?
  3. Is it cheap to check whether the tractable-seeming approaches to A and B will actually work in practice?

If the answers are "yes", "no", "no", then I am inclined to agree that attacking C is the way to go. But also I think that combination of answers barely ever happens.

Concrete example time:

You're starting an app-based, two-sided childcare marketplace. You have identified the following problems, which are limiting your company:

A: In order to ensure that each daycare has enough children to justify staying on your platform, while also ensuring that enough availability remains that new customers can find childcare near them, you need to build feedback loops into your marketing tools for both suppliers and customers such that marketing spend will be distributed according to whichever side of the marketplace needs more people at any given time, on a granular per-location basis. B: In order to build trust and minimize legal risk, you need to create policies and procedures to ensure that all childcare providers are appropriately licensed, background-checked, and ensured, and to ensure that they remain eligible to provide childcare services (e.g. regularly requiring them to provide updated documents, scheduled and random inspections, etc) C: You aim to guarantee that what customers see in terms of availability is what they get, and likewise, that providers can immediately see when a slot is booked. People also need to be able to access your system at any time. You determine that what you need to do is make sure that the data you show your users on their own devices is always consistent with the data you have on your own servers and the state of the world, and always available (you shouldn't lose access to your calendar just because your wifi went out). You reduce this problem to the CAP problem.

In this scenario, I would say that "try to solve the CAP problem straight off the bat" is very much the wrong approach for this problem, and you should instead try to attack subproblems A and B instead.

As analogies to alignment go, I'm thinking approximately

A. How can we determine what a model has learned about the structure of the world by examining the process by which it converts observations about the current state into predictions about the next state (e.g. mechanistic interpretability) B. Come up with a coherent metric that measures how "in-distribution" a given input is to a given model, to detect cases where the model is operating outside of its training distribution C. Come up with a solution to the principal-agent problem.

In my ideal world, there are people working on all three subproblems, just on the off-chance that C is solvable. But in terms of things-that-actually-help-with-the-critical-path, I expect most of them to come from people who are working on A and B, or people who build off the work of the people working on A and B.

I am curious if you are thinking of different concrete examples of A/B/C as they come to alignment though.

Comment by faul_sname on What is to be done? (About the profit motive) · 2023-09-09T21:14:39.981Z · LW · GW

In other words, to maximize the chance for aligned AI, we must first make an aligned society.

"An aligned society" sounds like a worthy goal, but I'm not sure who "we" is in terms of specific people who can take specific actions towards that end.

I think proposals like this would benefit from specifying what the minimum viable "we" for the proposal to work is.

Comment by faul_sname on Have Attention Spans Been Declining? · 2023-09-08T19:03:48.118Z · LW · GW

I suspect it was supposed to be a joke about attention spans

Comment by faul_sname on AI #27: Portents of Gemini · 2023-09-01T04:40:35.578Z · LW · GW

You store everything on a cloud instance, where you don’t get to see the model weights and they don’t get to see your data either, and checks are made only to ensure you are within terms of service or any legal restrictions.

Is it actually possible to build a fine-tuning-and-model-hosting product such that

  1. The customer can't access the model weights
  2. The host can't access the training data, or the inputs or outputs of inference (and this "can't" is in the cryptography sense not the legal sense, because otherwise the host is a giant juicy target for hacking by state actors)
  3. The model can be fine-tuned based on customer data
  4. The system does not cost multiple orders of magnitude more than an alternative system which did not have these constraints

Maybe there's something extremely clever you can do along the lines of "homomorphic encryption but performant and parallelizable" but if there is I am not aware of it. Nor are e.g. the folks who manage host, which is a GPU sharing platform. I'm sure they would like to be able to write something more reassuring in the "security" section of their FAQ than "[the providers on our platform] have little to gain and much to lose from stealing customer data". So if there's a solution here I don't think it's a well-known one.

My impression is that a robust solution to this problem is effectively a license to print money. New EA cause area and funding source?

Comment by faul_sname on Biosecurity Culture, Computer Security Culture · 2023-08-31T00:02:19.243Z · LW · GW

One key difference I see is that tremendous amounts of fungible value is locked away behind (hopefully) secure computing infrastructure, so in a world with keep-quiet norms there would be a tremendous financial incentive to defect on those norms.

As far as I know, no corresponding financial incentive exists for biosecurity (unless you count stuff like "antibiotics are an exploit against the biology of bacteria that people will pay lots of money for").

Comment by faul_sname on Digital brains beat biological ones because diffusion is too slow · 2023-08-27T05:08:55.243Z · LW · GW

Ultimately, alignment is whatever makes turning on an AI a good idea rather than a bad idea.

This is pithy, but I don't think it's a definition of alignment that points at a real property of an agent (as opposed to a property of the entire universe, including the agent).

If we have an AI which controls which train goes on which track, and can detect where all the trains in its network are but not whether or not there is anything on the tracks, whether or not this AI is "aligned" shouldn't depend on whether or not anyone happens to be on the tracks (which, again, the AI can't even detect).

The "things should be better instead of worse problem" is real and important, but it is much larger than anything that can reasonably be described as "the alignment problem".

Comment by faul_sname on "Throwing Exceptions" Is A Strange Programming Pattern · 2023-08-21T20:39:42.211Z · LW · GW

Programmer by trade here.

Philosophically, I view exceptions as the programmer saying "we have entered a state that we either cannot or will not handle correctly, and so we will punt this back up the call stack, and the caller can decide whether to abort, retry, or do something else". Frequently, the reason for being unable to handle that state is due to a lack of context -- for example, if someone is writing library code that deals with HTTP requests, "what should the progam do in the event of a network issue" is something that the writer of the library cannot answer (because the answer may be different in different contexts). In these cases, punting the decision up the call stack seems to be the obviously correct thing (though there is a bit of a holy war in programming over whether it is better to do this explicitly or implicitly).

Both sides of that holy war will generally agree that thrown exceptions are a slightly odd pattern. In terms of why one might want to use that odd pattern, it's easiest to see the advantages of the pattern by looking at what happens when you remove it. One alternative to using that pattern is to do what Rust does, and return Result<oktype,errtype> for everything which can fail.

So let's take an example Rust program which fetches an OpenAPI schema from a server, then makes a GET request against an endpoint, and determines whether the result matches the schema. This is a fairly simple task, and yet with explicit error handling (and without using the try operator, which is Rust's answer to thrown exceptions) it looks like this. Happy path code is mixed with error handling code, and as such it can be difficult to verify the correctness of either when reading the code.

If you want to argue that the improved happy-path-readability of code which uses thrown exceptions is not worth it, I can get back to you once I finish convincing people that vim is obviously better than emacs.

Comment by faul_sname on Summary of and Thoughts on the Hotz/Yudkowsky Debate · 2023-08-17T21:23:22.313Z · LW · GW

Thank you for providing those resources. They weren't quite what I was hoping to see, but they did help me see that I did not correctly describe what I was looking for.

Specifically, if we use the first paper's definition that "adversarially robust" means "inexploitable -- i.e. the agent will never cooperate with something that would defect against it, but may defect even if cooperating would lead to a C/C outcome and defecting would lead to D/D", one example of "an adversarially robust decision theory which does not require infinite compute" is "DefectBot" (which, in the language of the third paper, is a special case of Defect-Unless-Proof-Of-Cooperation-bot (DUPOC(0))).

What I actually want is an example of a concrete system that is

  1. Inexploitable (or nearly so): This system will never (or rarely) play C against something that will play D against it.
  2. Competitive: There is no other strategy which can, in certain environments, get long-term better outcomes than this strategy by sacrificing inexploitability-in-theory for performance-in-its-actual-environment-in-practice (for example, I note that in the prisoner's dilemma tournament back in 2013, the actual winner was a RandomBot despite some attempts to enter FairBot and friends, though also a lot of the bots in that tournament had Problems)
  3. Computationally tractable.

Ideally, it would also be

  1. Robust to the agents making different predictions about the effects of their actions. I honestly don't know what a solution to that problem would look like, even in theory, but "able to operate effectively in a world where not all effects of your actions are known in advance" seems like an important thing for a decision theory.
  2. Robust to the "trusting trust" problem (i.e. the issue of "how do you know that the source code you received is what the other agent is actually running"). Though if you have a solution for this problem you might not even need a solution to a lot of the other problems, because a solution to this problem implies an extremely powerful already-existing coordination mechanism (e.g. "all manufactured hardware has preloaded spyware from some trusted third party that lives in a secure enclave and can make a verifiable signed report of the exact contents of the memory and storage of that computer").

In any case, it may be time to run another PD tournament. Perhaps this time with strategies described in English and "evaluated" by an LLM, since "write a program that does the thing you want" seems to have been the blocking step for things people wanted to do in previous submissions.

Edit: I would be very curious to hear from the person who strong-disagreed with this about what, specifically, their disagreement is? I presume that the disagreement is not with my statement that I could have phrased my first comment better, but it could plausibly be any of "the set of desired characteristics is not a useful one", "no, actually, we don't need another PD tournament", or "We should have another PD tournament, but having the strategies be written in English and executed by asking an LLM what the policy does is a terrible idea".

Comment by faul_sname on Summary of and Thoughts on the Hotz/Yudkowsky Debate · 2023-08-17T01:50:20.229Z · LW · GW

1:24:00 Hotz says this is the whole crux and we got to something awesome here. Asserts that provable prisoner’s dilemma cooperation is impossible so we don’t have to worry about this scenario, everything will be defecting on everything constantly for all time, and also that’s great. Yudkowsky says the ASIs are highly motivated to find a solution and are smart enough to do so, does not mention that we have decision theories and methods that already successfully do this given ASIs (which we do).

We do? Can you point out what these methods are, and ideally some concrete systems which use them that have been demonstrated to be effective in e.g. one of the prisoner's dilemma tournaments.

Because my impression is that an adversarially robust decision theory which does not require infinite compute is very much not a thing we have.

Comment by faul_sname on The cone of freedom (or, freedom might only be instrumentally valuable) · 2023-07-24T22:22:32.151Z · LW · GW

It apparently means "since" in Shakespearean English.

Comment by faul_sname on Examples of Prompts that Make GPT-4 Output Falsehoods · 2023-07-23T06:38:22.840Z · LW · GW

It's hedging for the possibility that the isotope ratios are changing over time due to the behaviors of intelligent agents like humans. Or at least that's my headcanon.

Comment by faul_sname on AI #16: AI in the UK · 2023-06-15T22:51:51.287Z · LW · GW

Sayash Karpoor and Arvind Narayanan say licensing of models wouldn’t work because it is unenforceable, and also it would stifle competition and worsen AI risks. I notice those two claims tend to correlate a lot, despite it being really very hard for both of them to be true at once – either you are stifling the competition or you’re not, although there is a possible failure mode where you harm other efforts but not enough. The claimed ‘concentration risks’ and ‘security vulnerabilities’ do not engage with the logic behind the relevant extinction risks.

Making it harder to legally use models accomplishes two things:

  1. Decreases the number of people who use those models
  2. Among the people who are still using the models, increases the fraction of them who broke laws to do that.

Consider the situation with opiates in the US: our attempts to erect legal barriers for people obtaining opiates has indeed reduced the number of people legally obtaining opiates, and probably even reduced total opiate consumption, but at the cost that a lot of people were driven to buy their opiates illegally instead of going through medical channels.

I don't expect computing power sufficient to train powerful models to be easier to control than opiates, in worlds where doom looks like rapid algorithmic advancements that decrease the resource requirements to train and run powerful models by orders of magnitude.

Comment by faul_sname on Why "AI alignment" would better be renamed into "Artificial Intention research" · 2023-06-15T20:22:00.967Z · LW · GW

So the idea is to use "Artificial Intention" to specifically speak of the subset of concerns about what outcomes an artificial system will try to steer for, rather than the concerns about the world-states that will result in practice from the interaction of that artificial system's steering plus the steering of everything else in the world?

Makes sense. I expect it's valuable to also have a term for the bit where you can end up in a situation that nobody was steering for due to the interaction of multiple systems, but explicitly separating those concerns is probably a good idea.

Comment by faul_sname on Why "AI alignment" would better be renamed into "Artificial Intention research" · 2023-06-15T17:22:40.763Z · LW · GW

I think the issues you point out with the "alignment" name are real issues. That said, the word "intent" comes with its own issues.

Intention doesn't have to be conscious or communicable. It is just a preference for some futures over others, inferred as an explanation for behavior that chooses some future over others. Like, even single celled organisms have basic intentions if they move towards nutrients or away from bad temperatures.

I don't think "intention" is necessarily the best word for this unless you go full POSIWID. A goose does not "intend" to drag all vaguely egg-shaped objects to her nest and sit on them, in the sense that I don't think geese prefer sitting on a clutch of eggs over a clutch of eggs, a lightbulb, and a wooden block. And yet that is the expressed behavior anyway, because lightbulbs were rare and eggs that rolled out of the nest common in the ancestral environment.

I think "artificial system fitness-for-purpose" might come closer to gesturing about what "AI alignment" is pointing at (including being explicit about the bit that it's a 2-place term), but at the cost of being extremely not catchy.

Comment by faul_sname on The Dictatorship Problem · 2023-06-12T03:53:19.991Z · LW · GW

and freezing bank accounts of people whose only crime was donating money to the protesters

Slightly off topic, but is this a thing that was actually verified to have happened? The only case I had heard of was the "Brianne from Chilliwack" one that seemed not to pan out as real as far as I can tell.

(Asking because at the time there was quite a bit of discussion about whether the overreach was "trying to punish protestors directly" or "deliberately trying to create a chilling effect on any support for protests")

Comment by faul_sname on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-10T03:07:18.493Z · LW · GW

There's no way they could interoperate without massive computing and building a new model.

It historically has been shown that one can interpolate between a vision model and a language model[1]. And, more recently, it has been shown that yes, you can use a fancy transformer to map between intermediate representations in your image and text models, but you don't have to do that and in fact it works fine[2] to just use your frozen image encoder, then a linear mapping (!), then your text decoder.

I personally expect a similar phenomenon if you use the first half of an English-only pretrained language model and the second half of a Japanese-only pretrained language model -- you might not literally be able to use a linear mapping as above, but I expect you could use a quite cheap mapping. That said, I am not aware of anyone who has actually attempted the thing so I could be wrong that the result from [2] will generalize that far.

(Aside: was that a typo, or did you intend to say "compute" instead of "computing power"?)

Yeah, I did mean "computing power" there. I think it's just a weird way that people in my industry use words.[3]

  1. ^

    Example: DeepMind's Flamingo, which demonstrated that it was possible at all to take pretrained language model and a pretrained vision model, and glue them together into a multimodal model, and that doing so produced SOTA results on a number of benchmarks. See also this paper, also out of DeepMind.

  2. ^
  3. ^

    For example, see this HN discussion about it. See also the "compute" section of this post, which talks about things that are "compute-bound" rather than "bounded on the amount of available computing power".

    Why waste time use lot word when few word do trick?

Comment by faul_sname on Steering GPT-2-XL by adding an activation vector · 2023-06-05T22:56:48.468Z · LW · GW

Your colab's "Check it can speak French" section seems to be a stub.


Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we're adding a "coefficient 56" steering vector to forward passes. This should probably be substantially smaller. I haven't examined this yet.

Updated the colab to try out this approach with a range of coefficients.

  • From 0.001 to 0.01 seems to have very little effect ("He oversaw a handful of slow-moving major projects—such as the "Waterfront Park" which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances")
  • 0.02 to 0.1 seems to have effects like "model lapses in and out of French" and "names look French" ("In 1955, sent Soups Maryaine Hagné de la Breaise (de l'architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a "pig cure," and then as "weird" mayor, in lieu of actualizing their real grievances.")
  • 0.2 to 5 seems to steer the model to switch from English to French-shaped text ("1950 vivienes an un qué de neous nechien en zanappressant.")
  • At 10, the model seems to decide that words like "le" and "en" and "mal" are as French as things get ("le le enne les le le dedan le renous en le arriu du recenac")

However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.

Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French. GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at "take the difference between two things that seem like they might plausibly have a similar relationship".

Update: I have gotten GPT-J-6B up and running on Colab (link, it's a new one), and working alright with TransformerLens and montemac's algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I'm fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.

Comment by faul_sname on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-03T04:12:30.300Z · LW · GW

Let's say we have a language model that only knows how to speak English and a second one that only knows how to speak Japanese. Is your expectation that there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the "glue" takes <1% of the compute used to train the independent models?

I weakly expect the opposite, largely based on stuff like this, and based on playing around with using algebraic value editing to get an LLM to output French in response to English (but also note that the LLM I did that with knew English and the general shape of what French looks like, so there's no guarantee that result scales or would transfer the way I'm imagining).

Comment by faul_sname on The Crux List · 2023-06-01T17:54:20.623Z · LW · GW

I think we also care about how fast it gets arbitrarily capable. Consider a system which finds an approach which can measure approximate actions-in-the-world-Elo (where an entity with an advantage of 200 on their actions-in-the-world-Elo score will choose a better action 76% of the time), but it's using a "mutate and test" method over an exponentially large space, such that the time taken to find the next 100 point gain takes 5x as long, and it starts out with an actions-in-the-world-Elo 1000 points lower than an average human with a 1 week time-to-next-improvement. That hypothetical system is technically a recursively self-improving intelligence that will eventually reach any point of capability, but it's not really one we need to worry that much about unless it finds techniques to dramatically reduce the search space.

Like I suspect that GPT-4 is not actually very far from the ability to come up with a fine-tuning strategy for any task you care to give it, and to create a simple directory of fine-tuned models, and to create a prompt which describes to it how to use that directory of fine-tuned models. But fine-tuning seems to take an exponential increase in data for each linear increase in performance, so that's still not a terribly threatening "AGI".

Comment by faul_sname on Steering GPT-2-XL by adding an activation vector · 2023-06-01T06:00:48.807Z · LW · GW

I just tried that, and it kinda worked. Specifically, it worked to get gpt2-small to output text that structurally looks like French, but not to coherently speak French.

Although I then just tried feeding the base gpt2-small a passage in French, and its completions there were also incoherent, so I think it's just that that version hasn't seen enough French to speak it very well.

Comment by faul_sname on Steering GPT-2-XL by adding an activation vector · 2023-06-01T05:54:33.315Z · LW · GW

I found an even dumber approach that works. The approach is as follows:

  1. Take three random sentences of Wikipedia.
  2. Obtain a French translation for each sentence.
  3. Determine the boundaries corresponding phrases in each English/French sentence pair.
  4. Mark each boundary with "|"
  5. Count the "|"s, call that number n.
  6. For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
    The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l'album aux styles électropop de Ke$ha et Robyn.
  7. For each English->French sentence, make a +1 activation addition for that sentence and a -1 activation addition for the unmodified English sentence.
  8. Apply the activation additions.
  9. That's it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.

Example output: for the prompt

He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.

here are some of the outputs the patched model generates

...overcome the growing crime rate in Detroit. "Les défenseilant sur les necesite dans ce de l'en nouvieres éché de un enferrerne réalzation
...overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha
...overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants
...overcome the growing crime rate in Detroit. Il n'en fonentant 'le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities
...overcome the growing crime rate in Detroit. Il n'a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.

Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like

Il est devenu maire en 1957 après la mort d'Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour

That response translates as approximately

<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>

So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.

You can look at what I did at this colab. It is a very short colab.

Comment by faul_sname on The Crux List · 2023-06-01T01:15:19.083Z · LW · GW

More specifically, I think the crux is whether we mean direct or amortized optimization when talking about intelligence (or selection vs control if you prefer that framing).

Comment by faul_sname on Winners-take-how-much? · 2023-05-30T16:46:14.150Z · LW · GW

But there are conditions under which genocidal goals would be rational. On the contrary, willingly suffering a perpetual competition with 8-10 billion other people for the planet's limited resources is generally irrational, given a good alternative.

I am reminded of John von Neumann's thoughts on a nuclear first strike[1]. From the perspective of von Neumann in 1948, who thought through viable stable states of the world and worked backwards from there to the current state of the world to figure out what actions in today's world would lead to the stable one, a nuclear first strike does seem to be the only viable option. Though from today's perspective, the US didn't do that and we have not (or not yet) all perished in nuclear fire.

  1. ^

    From here:
    > Von Neumann was, at the time, a strong supporter of "preventive war." Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. "With the Russians it is not a question of whether but of when," he would say. An oft-quoted remark of his is, "If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o'clock, I say why not one o'clock?"

Comment by faul_sname on Hands-On Experience Is Not Magic · 2023-05-30T03:22:48.207Z · LW · GW

Well to flesh that out , we could have an ASI that seems valye aligned and controllable...until it isn't.

I think that scenario falls under the "worlds where iterative approaches fail" bucket, at least if prior to that we had a bunch of examples of AGIs that seemed and were value aligned and controllable, and the misalignment only showed up in the superhuman domain.

There is a different failure mode, which is "we see a bunch of cases of deceptive alignment in sub-human-capability AIs causing minor to moderate disasters, and we keep scaling up despite those disasters". But that's not so much "iterative approaches cannot work" as "iterative approaches do not work if you don't learn from your mistakes".

Comment by faul_sname on Hands-On Experience Is Not Magic · 2023-05-28T20:11:40.824Z · LW · GW

But I suppose thats still sort of moot from an existential risk perspective because FOOM and sharp turns aren't really a requirement.

It's not a moot point, because a lot of the difficulty of the problem as stated here is the "iterative approaches cannot work" bit.

Comment by faul_sname on Seeking (Paid) Case Studies on Standards · 2023-05-27T01:21:46.983Z · LW · GW

This seems great!

One additional example I know of, which I do not have personal experience with but know that a lot of people do have experience with, is compliance with PCI DSS (for credit card processing). Which does deal with safety in an adversarial setting where the threat model isn't super clear.

(my interactions with it look like "yeah that looks like a lot and we can outsource the risky bits to another company to deal with? great!")

Comment by faul_sname on Where do you lie on two axes of world manipulability? · 2023-05-27T00:37:55.949Z · LW · GW

Along the theme of "there should be more axes", I think one additional axis is "how path-dependent do you think final world states are". The negative side of this axis is "you can best model a system by figuring out where the stable equilibria are, and working backwards from there". The positive side of this axis is "you can best model a system as having a current state and some forces pushing that state in a direction, and extrapolating forwards from there".

If we define the axes as "tractable" / "possible" / "path-dependent", and work through each octant one by one, we get the following worldviews

  • -1/-1/-1: Economic progress cannot continue forever, but even if population growth is slowing now, the sub-populations that are growing will become the majority eventually, so population growth will continue until we hit the actual carrying capacity of the planet. Malthus was right, he was just early.
  • -1/-1/+1: Currently, the economic and societal forces in the world are pushing for people to become wealthier and more educated, all while population growth slows. As always there are bubbles and fads -- we had savings and loan, then the dotcom bubble, then the real estate bubble, then crypto, and now AI, and there will be more such fads, but none of them will really change much. The future will look like the present, but with more old people.
  • -1/+1/-1: The amount of effort to find further advances scales exponentially, but the benefit of those advances scales linearly. This pattern has happened over and over, so we shouldn't expect this time to be different. Technology will continue to improve, but those improvements will be harder and harder won. Nothing in the laws of physics prevents Dyson spheres, but our tech level is on track to reach diminishing returns far far before that point. Also by Laplace we shouldn't expect humanity to last more than a couple million more years.
  • -1/+1/+1: Something like a Dyson sphere is a large and risky project which would require worldwide buy-in. The trend now is, instead, for more and more decisions to be made by committee, and the number of parties with veto power will increase over time. We will not get Dyson spheres because they would ruin the character of the neighborhood.

    In the meantime, we can't even get global buy-in for the project of "let's not cook ourself with global warming". This is unlikely to change, so we are probably going to eventually end up with civilizational collapse due to something dumb like climate change or a pandemic, not a weird sci-fi disaster like a rogue superintelligence or gray goo.
  • +1/-1/-1: I have no idea what it would mean for things to be feasible but not physically possible. Maybe "simulation hypothesis"?
  • +1/-1/+1: Still have no idea what it means for something impossible to be feasible. "we all lose touch with reality and spend our time in video games, ready-player-one style"?
  • +1/+1/-1: Physics says that Dyson spheres are possible. The math says they're feasible if you cover the surface of a planet with solar panels and use the power generated to disassemble the planet into more solar panels, which can be used to disassemble the planet even faster. Given that, the current state of the solar system is unstable. Eventually, something is going to come along and turn Mercury into a Dyson sphere. Unless that something is very well aligned with humans, that will not end well for humans. (FOOM)
  • +1/+1/+1: Arms races have led to the majority of improvements in the past. For example, humans are as smart as they are because a chimp having a larger brain let it predict other chimps better, and thus work better with allies and out-reproduce its competitors. The wonders and conveniences of the modern world come mainly from either the side-effects of military research, or from companies competing to better obtain peoples' money. Even in AI, some of the most impressive results are things like StyleGAN (a generative adversarial network) and alphago (a network trained by self-play i.e. an arms-race against itself). Extrapolate forward, and you end up with an increasingly competitive world. This also probably does not end well for humans (whimper).

I expect people aren't evenly distributed across this space. I think the FOOM debate is largely between +1/+1/-1 and +1/+1/+1 octants. Also I think you can find doomers in every octant (or at least every octant that has people in it, I'm still not sure what the +1/-1/* quadrant would even mean).

Comment by faul_sname on [Market] Will AI xrisk seem to be handled seriously by the end of 2026? · 2023-05-26T03:52:10.407Z · LW · GW

Would "we get strong evidence that we're not in one of the worlds where iterative design is guaranteed to fail, and it looks like the group's doing the iterative design are proceeding with sufficient caution" qualify as a YES?

Comment by faul_sname on AI self-improvement is possible · 2023-05-23T06:10:15.517Z · LW · GW

Thanks for putting in the effort of writing this up.

Would you mind expanding on D:Prodigy? My impression is that most highly intelligent adults were impressive as children, but are more capable as adults than they were as children.

The phenomenon of child prodigies is indeed a real thing that exists. My impression of why that happens is that child and adult intellectual performance are not perfectly correlated, and thus the tails come apart. But I could be wrong about that, so if you have supporting material to that effect I'd be interested.

(as a note, I do agree that self-improvement is possible, but I think the shape of the curve is very important)

Comment by faul_sname on Tyler Cowen's challenge to develop an 'actual mathematical model' for AI X-Risk · 2023-05-17T08:29:48.443Z · LW · GW

One mathematical model that seems like it would be particularly valuable to have here is a model of the shapes of the resources invested vs optimization power curve. The reason I think an explicit model would be valuable there is that a lot of the AI risk discussion centers around recursive self-improvement. For example, instrumental convergence / orthogonality thesis / pivotal acts are relevant mostly in contexts where we expect a single agent to become more powerful than everyone else combined. (I am aware that there are other types of risk associated with AI, like "better AI tools will allow for worse outcomes from malicious humans / accidents". Those are outside the scope of the particular model I'm discussing).

To expand on what I mean by this, let's consider a couple of examples of recursive self-improvement.

For the first example, let's consider the game of Factorio. Let's specifically consider the "mine coal + iron ore + stone / smelt iron / make miners and smelters" loop. Each miner produces some raw materials, and those raw materials can be used to craft more miners. This feedback loop is extremely rapid, and once that cycle gets started the number of miners placed grows exponentially until all available ore patches are covered with miners.

For our second example, let's consider the case of an optimizing compiler like gcc. A compiler takes some code, and turns it into an executable. An optimizing compiler does the same thing, but also checks if there are any ways for it to output an executable that does the same thing, but more efficiently. Some of the optimization steps will give better results in expectation the more resources you allocate to them, at the cost of (sometimes enormously) greater required time and memory for the optimization step, and as such optimizing compilers like gcc have a number of flags that let you specify exactly how hard it should try.

Let's consider the following program:

# <snip gcc source download / configure steps>
while true; do
    make CC="gcc" CFLAGS="-O3 -finline-limit=$INLINE_LIMIT"
    make install

This is also a thing which will recursively self-improve, in the technical sense of "the result of each iteration will, in expectation, be better than the result of the previous iteration, and the improvements it finds help it more efficiently find future improvements". However, it seems pretty obvious that this "recursive self-improver" will not do the kind of exponential takeoff we care about.

The difference between these two cases comes down to the shapes of the curves. So one area of mathematical modeling I think would be pretty valuable would be

  1. Figure out what shapes of curves lead to gaining orders of magnitude more capabilities in a short period of time, given constant hardware
  2. The same question, but given the ability to rent or buy more hardware
  3. The same question, but now it invest in improving chip fabs, with the same increase in investment required for each improvement as we have previously observed for chip fabs
  4. What do the empirical scaling laws for deep learning look like? Do they look like they come in under the curves from 1-3? What if we look at the change in the best scaling laws over time -- where does that line point?
  5. Check whether your model now says that we should have been eaten by a recursively self improving AI in 1982. If it says that, the model may require additional work.

I will throw in an additional $300 bounty for an explicit model of this specific question, subject to the usual caveats (payable to only one person, can't be in a sanctioned country, etc), because I personally would like to know.

Edit: Apparently Tyler Cowen didn't actually bounty this. My $300 bounty offer stands but you will not be getting additional money from Tyler it looks like.

Comment by faul_sname on AGI-Automated Interpretability is Suicide · 2023-05-11T18:42:16.149Z · LW · GW

A system that looks like "actively try to make paperclips no matter what" seems like the sort of thing that an evolution-like process could spit out pretty easily. A system that looks like "robustly maximize paperclips no matter what" maybe not so much.

I expect it's a lot easier to make a thing which consistently executes actions which have worked in the past than to make a thing that models the world well enough to calculate expected value over a bunch of plans and choose the best one, and have that actually work (especially if there are other agents in the world, even if those other agents aren't hostile -- see the winner's curse).

Comment by faul_sname on Gradient hacking via actual hacking · 2023-05-10T20:05:44.835Z · LW · GW

Yeah if outputs of the training process are logged and processed by insecure software (which includes pretty much any software that handles "text" instead of "bytes") I think it's safe to say that superhuman-hackerbot-which-controls-its-own-outputs pwns the log processing machine.

BTW of possible interest to you is Automated Repair of Binary and Assembly Programs for Cooperating Embedded Devices. It's from 2013 so there are probably better examples by now, but the key passage (well it's buried deep within a paragraph of the "Limitations and Caveats" section, but it's key for this context) is:

The fine granularity of repairs at the ASM and ELF levels may be a poor match for conventional test suites. For example, we have observed ASM-level repairs that change the calling convention of one particular function. Such a repair has no direct representation at the C source level, and a test suite designed to maximize statement coverage (for example) may not speak to the validity of such a repair. Producing efficient test suites that give confidence that an implementation adheres to its specification remains an open problem in software engineering. Our work shares this general weakness with all other approaches that use test suites or workloads to validate candidate repairs (e.g., Clearview [26] and GenProg [35]). In this regard, sandboxing is crucial: we have observed ASM variants that subvert the testing framework by deleting key test files, leading to perfect fitness for all subsequent variants until the test framework is repaired.

tl;dr: If you train a model to avoid test failures, and don't sandbox it properly, it will figure out that the tests won't fail if there are no tests. This is not a hypothetical future scenario that might happen or might be impossible, this is a thing that has already happened (10 years ago, even).

Comment by faul_sname on Gradient hacking via actual hacking · 2023-05-10T05:45:10.128Z · LW · GW

If the threat model is literally "a pytorch bug allows a model like your nanoGPT example, which is made of simple tensor operations, to access memory from outside the process, via a bug that is triggered when the tensors have specific values", that nanoGPT script looks pretty safe.

If your threat model includes "a user on that machine does a cat and that machine is vulnerable to something like CVE-2019-9535", then that script is unsafe if the model can precisely control at the bit level a modestly long (hundreds of bytes) contiguous section of its own weights (caveat: it can actually only control 8 of every 10 bytes this way in the particular encoding nanoGPT uses. Though that's still enough to fit escape sequences so it's probably still enough to pwn a vulnerable terminal emulator).

If your threat model is "the user will give a model-under-training the ability to execute code, and try to create a gradient where the model executes the most useful code" then the model will be able to execute code and it's not even worthy of the word "hack" because that's what the programmer meant to do.

Comment by faul_sname on Prizes for matrix completion problems · 2023-05-08T21:14:24.813Z · LW · GW

I think running a single strain minimization iteration on m points in n dimensions takes O(m*n) steps. So there would need to be some reason to expect that it would converge in some constant (though possibly large) number of steps.

Unless you're saying "for each node, run the strain minimization step until it converges, then do the same for each subsequent node". I don't know if the greedy algorithm works there, but if it does then maybe?

Also I kinda expect that if there's something that works in O(n*m*log(m)) that's probably fine.

(and yeah, "try the greedy exact solution for each node" was my "dumb solution" attempt).

Comment by faul_sname on LLM cognition is probably not human-like · 2023-05-08T05:42:15.649Z · LW · GW

Suppose for concreteness, on a specific problem (e.g. Python interpreter transcript prediction), GPT-3 makes mistakes that look like humans-making-snap-judgement mistakes, and then GPT-4 gets the answer right all the time. Or, suppose GPT-5 starts playing chess like a non-drunk grandmaster.

Would that result imply that the kind of cognition performed by GPT-3 is fundamentally, qualitatively different from that performed by GPT-4? Similarly for GPT-4 -> GPT-5.

In the case of the Python interpreter transcript prediction task, I think if GPT-4 gets the answer right all the time that would indeed imply that GPT-4 is doing something qualitatively different than GPT-3. I don't think it's actually possible to get anywhere near 100% accuracy on that task without either having access to, or being, a Python interpreter.

Likewise, in the chess example, I expect that if GPT-5 is better at chess than GPT-4, that will look like "an inattentive and drunk super-grandmaster, with absolutely incredible intuition about the relative strength of board-states, but difficulty with stuff like combinations (but possibly with the ability to steer the game-state away from the board states it has trouble with, if it knows it has trouble in those sorts of situations)". If it makes the sorts of moves that human grandmasters play when they are playing deliberately, and the resulting play is about as strong as those grandmasters, I think that would show a qualitatively new capability.

Also, my model isn't "GPT's cognition is human-like". It is "GPT is doing the same sort of thing humans do when they make intuitive snap judgements". In many cases it is doing that thing far far better than any human can. If GPT-5 comes out, and it can natively do tasks like debugging a new complex system by developing and using a gears-level model of that system, I think that would falsify my model.

Also also it's important to remember that "GPT-5 won't be able to do that sort of thing natively" does not mean "and therefore there is no way for it to do that sort of thing, given that it has access to tools". One obvious way for GPT-4 to succeed at the "predict the output of running Python code" is to give it the ability to execute Python code and read the output. The system of "GPT-4 + Python interpreter" does indeed perform a fundamentally, qualitatively different type of cognition that "GPT-4 alone". But "it requires a fundamentally different type of cognition" does not actually mean "the task is not achievable by known means".

Also also also.,I mostly care about this model because it suggests interesting things to do on the mechanistic interpretability front. Which I am currently in the process of learning how to do. My personal suspicion is that the bags of tensors are not actually inscrutable, and that looking at these kinds of mistakes would make some of the failure modes of transformers no-longer-mysterious.

Comment by faul_sname on LLM cognition is probably not human-like · 2023-05-08T04:08:35.904Z · LW · GW

Great post!

Would a human, asked to predict the next token of any of the sequences above, be likely to come up with similar probability distributions for similar reasons? Probably not, though depending on the human, how much they know about Python, and how much effort they put into the making their prediction, the output that results from sampling from the human's predicted probability distribution might match the output of sampling text-davinci's distribution, in some cases. But the LLM and the human probably arrive at their probability distributions through vastly different mechanisms.

I don't think a human would come up with a similar probability distribution. But I think that's because asking a human for a probability distribution forces them to switch from the "pattern-match similar stuff they've seen in the past" strategy to the "build an explicit model (or several)" strategy.

I think the equivalent step is not "ask a single human for a probability distribution over the next token", but, instead, "ask a large number of humans who have lots of experience with Python and the Python REPL to make a snap judgement of what the next token is".

BTW rereading my old comment, I see that there are two different ways you can interpret it:

  1. "GPT-n makes similar mistakes to humans that are not paying attention[, and this is because it was trained on human outputs and will thus make similar mistakes to the ones it was trained on. If it were trained on something other than human outputs, like sensor readings, it would not make these sorts of mistakes.]".
  2. "GPT-n makes similar mistakes to humans that are not paying attention[, and this is because GPT-n and human brains making snap judgements are both doing the same sort of thing. If you took a human and an untrained transformer, and some process which deterministically produced a complex (but not pure noise) data stream, and converted it to an audio stream for the human and a token stream for the transformer, and trained them both on the first bit of it, they would both be surprised by similar bits of the part that they had not been trained on. ]."

I meant something more like the second interpretation. Also "human who is not paying attention" is an important part of my model here. GPT-4 can play mostly-legal chess, but I think that process should be thought of as more like "a blindfolded, slightly inebriated chess grandmaster plays bullet chess" not "a human novice plays the best chess that they can".

I could very easily be wrong about that! But it does suggest some testable hypotheses, in the form of "find some process for which generates a somewhat predictable sequence, train both a human and a transformer to predict that sequence, and see if they make the same types of errors or completely different types of errors".

Edit: being more clear that I appreciate the effort that went into this post and think it was a good post

Comment by faul_sname on Prizes for matrix completion problems · 2023-05-05T23:21:46.465Z · LW · GW

I think you can convert between the two representations in O(m) time, which would mean that any algorithm that solves either version in O(n*m) solves both in O(n*m).

Do you have some large positive and negative examples of the kinds of sparse matrix you're trying to check for the existence of a PSD completion on, or alternatively a method for generating such examples with a known ground truth? I have a really dumb idea for a possible algorithm here (that shamelessly exploits the exact shape of this problem in a way that probably doesn't generalize to being useful for broader problems like MDS) that I think would complete in approximately the time constraints you're looking for. It almost certainly won't work, but I think it's at least worth an hour of my time to check and figure out why (especially since I'm trying to improve my linear algebra skills anyway).

Edit: there's the obvious approach, which I'm trying, of "start with only 1s on the diagonal and then keep adding random entries until it no longer has a PSD completion, then removing random entries until it does, and repeat to build a test set" but I doubt that covers the interesting corners of the problem space.

Edit 2: the really dumb thing does not work. I think I haven't ruled out that a slightly less dumb approach could work though?

Edit 3: never mind, my really dumb "solution" requires inverting a matrix that is, in the worst case, nxn, if e.g. you have an input that looks like

1 n n n n n n
n 1 - - - - n
n - 1 - - - n
n - - 1 - - n
n - - - 1 - n
n - - - - 1 n
n n n n n n 1

you'll have to invert 6 2x2 matrices and one each of 3x3 to 7x7 matrices.

Comment by faul_sname on Prizes for matrix completion problems · 2023-05-05T01:05:19.206Z · LW · GW

When I tried asking about this problem, and referenced the above comment about being able to assume that all entries on the diagonal are positive, GPT-4 said the following:

If an n×n matrix is a symmetric positive semidefinite (PSD) matrix, it can be interpreted as a Gram matrix representing the dot products of a set of n vectors, and this configuration has 0 stress.

To understand this, let's first consider what a Gram matrix is. Given a set of n vectors v_1, v_2, ..., v_n in a d-dimensional space, the Gram matrix G is an n×n matrix where the entry G(i, j) is the dot product of vectors v_i and v_j:

G(i, j) = v_i • v_j

A Gram matrix is always symmetric (i.e., G(i, j) = G(j, i)) and positive semidefinite.

Now, let's consider the stress function in the context of multidimensional scaling (MDS). The stress function measures the difference between the given dissimilarities (distances) and the distances in the lower-dimensional space. When the stress is 0, this means that the distances in the lower-dimensional space exactly match the given dissimilarities.

Is GPT-4 correct? If so, am I interpreting it correctly that this problem could be rephrased as "given an incomplete set of m desired pairwise distances between n points, determine whether there exists some configuration of those points in an n-dimensional space such that the pairwise distances between those points are exactly as desired"?

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-02T04:57:54.626Z · LW · GW

What if you had assigned less than 0.01% to "RSI is so trivial that the first kludged loop to GPT-4 by an external user without access to the code or weights would successfully self-improve"?

I would think you were massively overconfident in that. I don't think you could make 10,000 predictions like that and only be wrong once (for a sense of intuition, that's like making one prediction per hour, 8 hours per day, 5 days a week for 5 years, and being wrong once).

Unless you mean "recursively self-improve all the way to godhood" instead of "recursively self-improve to the point where it would discover things as hard as the first improvement it found in like 10% as much time as it took originally".

For reference for why I did give at least 10% to "the dumbest possible approach will work to get meaningful improvement" -- humans spent many thousands of years not developing much technology at all, and then, a few thousand years ago, suddenly started doing agriculture and building cities and inventing tools. The difference between "humans do agriculture" and "humans who don't" isn't pure genetics -- humans came to the Americas over 20,000 years ago, agriculture has only been around for about 10,000 of those 20,000 years, and yet there were fairly advanced agricultural civilizations in the Americas thousands of years ago. Which says to me that, for humans at least, most of our ability to do impressive things comes from our ability to accumulate a bunch of tricks that work over time, and communicate those tricks to others.

So if it turned out that "the core of effectiveness for a language model is to make a dumb wrapper script and the ability to invoke copies of itself with a different wrapper script, that's enough for it to close the gap between the capabilities of the base language model and the capabilities of something as smart as the base language model but as coherent as a human", I would have been slightly surprised, but not surprised enough that I could have made 10 predictions like that and only been wrong about one of them. Certainly not 100 or 10,000 predictions like that.

Edit: Keep in mind that the dumbest possible approach of "define a JSON file that describes the tool and ensure that that JSON file has a link to detailed API docs does work for teaching GPT-4 how to use tools.

Comment by faul_sname on Natural Selection vs Gradient Descent · 2023-05-01T23:33:52.959Z · LW · GW

Yeah, I personally think the better biological analogue for gradient descent is the "run-and-tumble" motion of bacteria.

Take an e. coli. It has a bunch of flagella, pointing in all directions. When it rotates its flagella clockwise, each of them ends up pushing in a random direction, which results in the cell chaotically tumbling without going very far. When it rotates its flagella counterclockwise, they get tangled up with each other and all end up pointing the same direction, and the cell moves in a roughly straight line. The more attractants and fewer repellants there are, the more the cell rotates its flagella counterclockwise.

And that's it. That's the entire strategy by which e. coli navigates to food.

Here's a page with an animation of how this extremely basic behavior approximates gradient descent.

All that said, evolution looks kinda like gradient descent if you squint. For mind design, evolution would be gradient descent over the hyperparameters (and cultural evolution would be gradient descent over the training data generation process, and learning would be gradient descent over sensory data, and all of these gradients would steer in different but not entirely orthogonal directions).

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T21:55:20.485Z · LW · GW

Yeah. I had thought that you used the wording "don't update me at all" instead of "aren't at all convincing to me" because you meant something precise that was not captured by the fuzzier language. But on reflection it's probably just that language like "updating" is part of the vernacular here now.

Sorry, I had meant that to be a one-off side note, not a whole thing.

The bit I actually was surprised by was that you seem to think there was very little chance that the crude approach could have worked. In my model of the world, "the simplest thing that could possibly work" ends up working a substantial amount of the time. If your model of the world says the approach of "just piling more hacks and heuristics on top of AutoGPT-on-top-of-GPT4 will get it to the point where it can come up with additional helpful hacks and heuristics that further improve its capabilities" almost certainly won't work that's a bold and interesting advance prediction in my book.

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T19:13:11.624Z · LW · GW

Hm, I think I'm still failing to communicate this clearly.

RSI might be practical, or it might not be practical. If it is practical, it might be trivial, or it might be non-trivial.

If, prior to AutoGPT and friends, you had assigned 10% to "RSI is trivial", and you make an observation of whether RSI is trivial, you should expect that

  • 10% of the time, you observe that RSI is trivial. You update to 100% to "RSI is trivial", 0% "RSI is practical but not trivial", 0% "RSI is impractical".
  • 90% of the time, you observe that RSI is not trivial. You update to 0% "RSI is trivial", 67% "RSI is practical but not trivial", 33% "RSI is impractical".

By "does your model exclude the possibility of RSI-through-hacking-an-agent-together-out-of-LLMs", I mean the following: prior to someone first hacking together AutoGPT, you thought that there was less than a 10% chance that something like that would work to do the task of "make and test changes to its own architecture, and keep the ones that worked" well enough to be able to do that task better.

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T18:36:20.475Z · LW · GW

If you had said "very little evidence" I would not have objected. But if there are several possible observations which update you towards RSI being plausible, and no observations that update you against RSI being plausible, something has gone wrong.

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T18:31:49.179Z · LW · GW

If I had never seen a glider before, I would think there was a nonzero chance that it could travel a long distance without self-propulsion. So if someone runs the experiment of "see if you can travel a long distance with a fixed wing glider and no other innovations", I could either observe that it works, or observe that it doesn't.

If you can travel a long distance without propulsion, that obviously updates me very far in the direction of "fixed-wing flight works".

So by conservation of expected evidence, observing that a glider with no propulsion doesn't make it very far has to update me at least slightly in the direction of "fixed-wing flight does not work". Because otherwise I would expect to update in the direction of "fixed-wing flight works" no matter what observation I made.

Note that OP said "does not update me at all" not "does not update me very much" -- and the use of the language "update me" implies the strong "in a bayesian evidence sense" meaning of the words -- this is not a nit I would have picked if OP had said "I don't find the failures of autogpt and friends to self-improve to be at all convincing that RSI is impossible".

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T17:56:52.645Z · LW · GW
  1. Initial attempts from API-users putting LLMs into agentic wrappers (e.g. AutoGPT, BabyAGI) don't seem to have made any progress.
  • I would not expect those attempts to work, and their failures don't update me at all against the possibility of RSI.

If the failures of those things to work don't update you against RSI, then if they succeed that can't update you towards the possibility of RSI.

I personally would not be that surprised, even taking into account the failures of the first month or two, if someone manages to throw together something vaguely semi-functional in that direction, and if the vaguely semi-functional version can suggest improvements to itself that sometimes help. Does your model of the world exclude that possibility?

Comment by faul_sname on Realistic near-future scenarios of AI doom understandable for non-techy people? · 2023-04-28T19:05:34.015Z · LW · GW

I don't think it is pointless to focus on specific ways everyone dies, unless there is a single strategy that addresses every possible way everyone dies.

If FOOM isn't likely but something like this is likely, it seems really unlikely to me that the approach of "continue to focus on strategies that rely on a single agent having a high level of control over the world" is still optimal (or, more accurately, it's probably still a good idea to have some people working on that but not all the people).

Comment by faul_sname on Contra Yudkowsky on Doom from Foom #2 · 2023-04-27T09:28:13.153Z · LW · GW

"Work on the safety of an ecosystem made up of a large number of in-some-ways-superhuman-and-in-other-ways-not AIs" seems like a very different problem than "ensure that when you build a single coherent, effectively-omniscient agent, you give it a goal that does not ruin everything when it optimizes really hard for that goal".

There are definitely parallels between the two scenarios, but I'm not sure a solution for the second scenario would even work to prevent an organization of AIs with cognitive blind spots from going off the rails.

My model of jacob_cannell's model is that the medium-term future looks something like "ad-hoc organizations of mostly-cooperating organizations of powerful-but-not-that-powerful agents, with the first organization to reach a given level of capability being the one that focused its resources on finding and using better coordination mechanisms between larger numbers of individual processes rather than the one that focused on raw predictive power", and that his model of Eliezer goes "no, actually focusing on raw predictive power is the way to go".

And I think the two different scenarios do in fact suggest different strategies.

Comment by faul_sname on Mental Models Of People Can Be People · 2023-04-25T04:18:18.058Z · LW · GW

What are your reasons for thinking that mental models are closer to markov models than tulpas?

I think this may just be a case of the typical mind fallacy: I don't model people in that level of detail in practice and I'm not even sure I'm capable of doing so. I can make predictions about "the kind of thing a person might say" based on what they've said before, but those predictions are more at the level of turns-of-phrase and favored topics of conversation -- definitely nothing like "long conversations on a level above GPT-4".

The "why people value remaining alive" bit might also be a typical mind fallacy thing. I mostly think about personal identity in terms of memories + preferences.

I do agree that my memories alone living on after my body dies would not be close to immortality to me. However, if someone were to train a multimodal ML model that can produce actions in the world indistinguishable from the actions I produce (or even "distinguishable but very very close"), I would consider that to be most of the way to effectively being immortal, assuming that model were actually run and had the ability to steer the world towards states which it prefers. Conversely, I'd consider it effectively-death to be locked in a box where I couldn't affect the state of the outside world and would never be able to exit the box. The scenario "my knowledge persists and can be used by people who share my values" would be worse, to me, than remaining alive but better than death without preserving my knowledge for people who share my values (and by "share my values" I basically just mean "are not actively trying to do things that I disprefer specifically because I disprefer them").

Comment by faul_sname on Mental Models Of People Can Be People · 2023-04-25T03:14:25.159Z · LW · GW

My argument in this post is that there do exist mental models of people that are sufficiently detailed to qualify as conscious moral patients;

Sounds reasonable for at least some values of "sufficiently detailed". At the limit, I expect that if someone had a computer emulation of my nervous system and all sensory information it receives, and all outputs it produces, and that emulation was good enough to write about its own personal experience of qualia for the same reasons I write about it, that emulation would "have" qualia in the sense that I care about.

At the other limit, a markov model trained on a bunch of my past text output which can produce writing which kinda sorta looks like it describes what it's like to have qualia almost certainly does not "have" qualia in the sense that I care about (though the system-as-a-whole that produced the writing, i.e. "me originally writing the stuff" plus "the markov model doing its thing" does have qualia -- they live in the "me originally experiencing the stuff I wrote about" bit).

In between the two extremes you've got stuff like tulpas, which I suspect are moral patients to the extent that it makes sense to talk about such a thing. That said, a lot of the reasons humans want to continue their thread of experience probably don't apply to most tulpas (e.g. when a human dies, the substrate they were running on stops functioning, all their memories are lost, and they lose their ability to steer the world towards states they prefer whereas if a tulpa "dies" its memories are retained and its substrate remains intact, though it still I think loses its ability to steer the world towards its preferred states).

I am hesitant to condemn anything which looks to me like "thoughtcrime", but to the extent that anything could be a thoughtcrime, "create tulpas and then do things that deeply violate their preferences" seems like one of those things. So if you're doing that, maybe consider doing not-that?

I also argue that this is common enough that authors good at characterization probably frequently create and destroy such people; finally, I argue that this is a bad thing.

"Any mental model of a person" seems to me like drawing the line quite a bit further than it should be drawn. I don't think mental models actually "have experiences" in any meaningful sense -- I think they're more analogous to markov models than they are to brain emulations (with the possible exception of tulpas and things like that, but those aren't the sort of situations you find yourself in accidentally).