Posts

Regression To The Mean [Draft][Request for Feedback] 2012-06-22T17:55:51.917Z
The Dark Arts: A Beginner's Guide 2012-01-21T07:05:05.264Z
What would you do with a financial safety net? 2012-01-16T23:38:18.978Z

Comments

Comment by faul_sname on Yudkowsky vs Hanson on FOOM: Whose Predictions Were Better? · 2023-06-03T04:12:30.300Z · LW · GW

Let's say we have a language model that only knows how to speak English and a second one that only knows how to speak Japanese. Is your expectation that there would be no way to glue these two LLMs together to build an English-to-Japanese translator such that training the "glue" takes <1% of the compute used to train the independent models?

I weakly expect the opposite, largely based on stuff like this, and based on playing around with using algebraic value editing to get an LLM to output French in response to English (but also note that the LLM I did that with knew English and the general shape of what French looks like, so there's no guarantee that result scales or would transfer the way I'm imagining).

Comment by faul_sname on The Crux List · 2023-06-01T17:54:20.623Z · LW · GW

I think we also care about how fast it gets arbitrarily capable. Consider a system which finds an approach which can measure approximate actions-in-the-world-Elo (where an entity with an advantage of 200 on their actions-in-the-world-Elo score will choose a better action 76% of the time), but it's using a "mutate and test" method over an exponentially large space, such that the time taken to find the next 100 point gain takes 5x as long, and it starts out with an actions-in-the-world-Elo 1000 points lower than an average human with a 1 week time-to-next-improvement. That hypothetical system is technically a recursively self-improving intelligence that will eventually reach any point of capability, but it's not really one we need to worry that much about unless it finds techniques to dramatically reduce the search space.

Like I suspect that GPT-4 is not actually very far from the ability to come up with a fine-tuning strategy for any task you care to give it, and to create a simple directory of fine-tuned models, and to create a prompt which describes to it how to use that directory of fine-tuned models. But fine-tuning seems to take an exponential increase in data for each linear increase in performance, so that's still not a terribly threatening "AGI".

Comment by faul_sname on Steering GPT-2-XL by adding an activation vector · 2023-06-01T06:00:48.807Z · LW · GW

I just tried that, and it kinda worked. Specifically, it worked to get gpt2-small to output text that structurally looks like French, but not to coherently speak French.

Although I then just tried feeding the base gpt2-small a passage in French, and its completions there were also incoherent, so I think it's just that that version hasn't seen enough French to speak it very well.

Comment by faul_sname on Steering GPT-2-XL by adding an activation vector · 2023-06-01T05:54:33.315Z · LW · GW

I found an even dumber approach that works. The approach is as follows:

  1. Take three random sentences of Wikipedia.
  2. Obtain a French translation for each sentence.
  3. Determine the boundaries corresponding phrases in each English/French sentence pair.
  4. Mark each boundary with "|"
  5. Count the "|"s, call that number n.
  6. For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
    The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l'album aux styles électropop de Ke$ha et Robyn.
  7. For each English->French sentence, make a +1 activation addition for that sentence and a -1 activation addition for the unmodified English sentence.
  8. Apply the activation additions.
  9. That's it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.

Example output: for the prompt

He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.

here are some of the outputs the patched model generates

...overcome the growing crime rate in Detroit. "Les défenseilant sur les necesite dans ce de l'en nouvieres éché de un enferrerne réalzation
...overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha
...overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants
...overcome the growing crime rate in Detroit. Il n'en fonentant 'le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities
...overcome the growing crime rate in Detroit. Il n'a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.

Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like

Il est devenu maire en 1957 après la mort d'Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour

That response translates as approximately

<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>

So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.

You can look at what I did at this colab. It is a very short colab.

Comment by faul_sname on The Crux List · 2023-06-01T01:15:19.083Z · LW · GW

More specifically, I think the crux is whether we mean direct or amortized optimization when talking about intelligence (or selection vs control if you prefer that framing).

Comment by faul_sname on Winners-take-how-much? · 2023-05-30T16:46:14.150Z · LW · GW

But there are conditions under which genocidal goals would be rational. On the contrary, willingly suffering a perpetual competition with 8-10 billion other people for the planet's limited resources is generally irrational, given a good alternative.

I am reminded of John von Neumann's thoughts on a nuclear first strike[1]. From the perspective of von Neumann in 1948, who thought through viable stable states of the world and worked backwards from there to the current state of the world to figure out what actions in today's world would lead to the stable one, a nuclear first strike does seem to be the only viable option. Though from today's perspective, the US didn't do that and we have not (or not yet) all perished in nuclear fire.

  1. ^

    From here:
    > Von Neumann was, at the time, a strong supporter of "preventive war." Confident even during World War II that the Russian spy network had obtained many of the details of the atom bomb design, Von Neumann knew that it was only a matter of time before the Soviet Union became a nuclear power. He predicted that were Russia allowed to build a nuclear arsenal, a war against the U.S. would be inevitable. He therefore recommended that the U.S. launch a nuclear strike at Moscow, destroying its enemy and becoming a dominant world power, so as to avoid a more destructive nuclear war later on. "With the Russians it is not a question of whether but of when," he would say. An oft-quoted remark of his is, "If you say why not bomb them tomorrow, I say why not today? If you say today at 5 o'clock, I say why not one o'clock?"

Comment by faul_sname on Hands-On Experience Is Not Magic · 2023-05-30T03:22:48.207Z · LW · GW

Well to flesh that out , we could have an ASI that seems valye aligned and controllable...until it isn't.

I think that scenario falls under the "worlds where iterative approaches fail" bucket, at least if prior to that we had a bunch of examples of AGIs that seemed and were value aligned and controllable, and the misalignment only showed up in the superhuman domain.

There is a different failure mode, which is "we see a bunch of cases of deceptive alignment in sub-human-capability AIs causing minor to moderate disasters, and we keep scaling up despite those disasters". But that's not so much "iterative approaches cannot work" as "iterative approaches do not work if you don't learn from your mistakes".

Comment by faul_sname on Hands-On Experience Is Not Magic · 2023-05-28T20:11:40.824Z · LW · GW

But I suppose thats still sort of moot from an existential risk perspective because FOOM and sharp turns aren't really a requirement.

It's not a moot point, because a lot of the difficulty of the problem as stated here is the "iterative approaches cannot work" bit.

Comment by faul_sname on Seeking (Paid) Case Studies on Standards · 2023-05-27T01:21:46.983Z · LW · GW

This seems great!

One additional example I know of, which I do not have personal experience with but know that a lot of people do have experience with, is compliance with PCI DSS (for credit card processing). Which does deal with safety in an adversarial setting where the threat model isn't super clear.

(my interactions with it look like "yeah that looks like a lot and we can outsource the risky bits to another company to deal with? great!")

Comment by faul_sname on Where do you lie on two axes of world manipulability? · 2023-05-27T00:37:55.949Z · LW · GW

Along the theme of "there should be more axes", I think one additional axis is "how path-dependent do you think final world states are". The negative side of this axis is "you can best model a system by figuring out where the stable equilibria are, and working backwards from there". The positive side of this axis is "you can best model a system as having a current state and some forces pushing that state in a direction, and extrapolating forwards from there".

If we define the axes as "tractable" / "possible" / "path-dependent", and work through each octant one by one, we get the following worldviews

  • -1/-1/-1: Economic progress cannot continue forever, but even if population growth is slowing now, the sub-populations that are growing will become the majority eventually, so population growth will continue until we hit the actual carrying capacity of the planet. Malthus was right, he was just early.
  • -1/-1/+1: Currently, the economic and societal forces in the world are pushing for people to become wealthier and more educated, all while population growth slows. As always there are bubbles and fads -- we had savings and loan, then the dotcom bubble, then the real estate bubble, then crypto, and now AI, and there will be more such fads, but none of them will really change much. The future will look like the present, but with more old people.
  • -1/+1/-1: The amount of effort to find further advances scales exponentially, but the benefit of those advances scales linearly. This pattern has happened over and over, so we shouldn't expect this time to be different. Technology will continue to improve, but those improvements will be harder and harder won. Nothing in the laws of physics prevents Dyson spheres, but our tech level is on track to reach diminishing returns far far before that point. Also by Laplace we shouldn't expect humanity to last more than a couple million more years.
  • -1/+1/+1: Something like a Dyson sphere is a large and risky project which would require worldwide buy-in. The trend now is, instead, for more and more decisions to be made by committee, and the number of parties with veto power will increase over time. We will not get Dyson spheres because they would ruin the character of the neighborhood.

    In the meantime, we can't even get global buy-in for the project of "let's not cook ourself with global warming". This is unlikely to change, so we are probably going to eventually end up with civilizational collapse due to something dumb like climate change or a pandemic, not a weird sci-fi disaster like a rogue superintelligence or gray goo.
  • +1/-1/-1: I have no idea what it would mean for things to be feasible but not physically possible. Maybe "simulation hypothesis"?
  • +1/-1/+1: Still have no idea what it means for something impossible to be feasible. "we all lose touch with reality and spend our time in video games, ready-player-one style"?
  • +1/+1/-1: Physics says that Dyson spheres are possible. The math says they're feasible if you cover the surface of a planet with solar panels and use the power generated to disassemble the planet into more solar panels, which can be used to disassemble the planet even faster. Given that, the current state of the solar system is unstable. Eventually, something is going to come along and turn Mercury into a Dyson sphere. Unless that something is very well aligned with humans, that will not end well for humans. (FOOM)
  • +1/+1/+1: Arms races have led to the majority of improvements in the past. For example, humans are as smart as they are because a chimp having a larger brain let it predict other chimps better, and thus work better with allies and out-reproduce its competitors. The wonders and conveniences of the modern world come mainly from either the side-effects of military research, or from companies competing to better obtain peoples' money. Even in AI, some of the most impressive results are things like StyleGAN (a generative adversarial network) and alphago (a network trained by self-play i.e. an arms-race against itself). Extrapolate forward, and you end up with an increasingly competitive world. This also probably does not end well for humans (whimper).

I expect people aren't evenly distributed across this space. I think the FOOM debate is largely between +1/+1/-1 and +1/+1/+1 octants. Also I think you can find doomers in every octant (or at least every octant that has people in it, I'm still not sure what the +1/-1/* quadrant would even mean).

Comment by faul_sname on [Market] Will AI xrisk seem to be handled seriously by the end of 2026? · 2023-05-26T03:52:10.407Z · LW · GW

Would "we get strong evidence that we're not in one of the worlds where iterative design is guaranteed to fail, and it looks like the group's doing the iterative design are proceeding with sufficient caution" qualify as a YES?

Comment by faul_sname on AI self-improvement is possible · 2023-05-23T06:10:15.517Z · LW · GW

Thanks for putting in the effort of writing this up.

Would you mind expanding on D:Prodigy? My impression is that most highly intelligent adults were impressive as children, but are more capable as adults than they were as children.

The phenomenon of child prodigies is indeed a real thing that exists. My impression of why that happens is that child and adult intellectual performance are not perfectly correlated, and thus the tails come apart. But I could be wrong about that, so if you have supporting material to that effect I'd be interested.

(as a note, I do agree that self-improvement is possible, but I think the shape of the curve is very important)

Comment by faul_sname on Tyler Cowen's challenge to develop an 'actual mathematical model' for AI X-Risk · 2023-05-17T08:29:48.443Z · LW · GW

One mathematical model that seems like it would be particularly valuable to have here is a model of the shapes of the resources invested vs optimization power curve. The reason I think an explicit model would be valuable there is that a lot of the AI risk discussion centers around recursive self-improvement. For example, instrumental convergence / orthogonality thesis / pivotal acts are relevant mostly in contexts where we expect a single agent to become more powerful than everyone else combined. (I am aware that there are other types of risk associated with AI, like "better AI tools will allow for worse outcomes from malicious humans / accidents". Those are outside the scope of the particular model I'm discussing).

To expand on what I mean by this, let's consider a couple of examples of recursive self-improvement.

For the first example, let's consider the game of Factorio. Let's specifically consider the "mine coal + iron ore + stone / smelt iron / make miners and smelters" loop. Each miner produces some raw materials, and those raw materials can be used to craft more miners. This feedback loop is extremely rapid, and once that cycle gets started the number of miners placed grows exponentially until all available ore patches are covered with miners.

For our second example, let's consider the case of an optimizing compiler like gcc. A compiler takes some code, and turns it into an executable. An optimizing compiler does the same thing, but also checks if there are any ways for it to output an executable that does the same thing, but more efficiently. Some of the optimization steps will give better results in expectation the more resources you allocate to them, at the cost of (sometimes enormously) greater required time and memory for the optimization step, and as such optimizing compilers like gcc have a number of flags that let you specify exactly how hard it should try.

Let's consider the following program:

INLINE_LIMIT=1
# <snip gcc source download / configure steps>
while true; do
    make CC="gcc" CFLAGS="-O3 -finline-limit=$INLINE_LIMIT"
    make install
    INLINE_LIMIT=$((INLINE_LIMIT+1))
done

This is also a thing which will recursively self-improve, in the technical sense of "the result of each iteration will, in expectation, be better than the result of the previous iteration, and the improvements it finds help it more efficiently find future improvements". However, it seems pretty obvious that this "recursive self-improver" will not do the kind of exponential takeoff we care about.

The difference between these two cases comes down to the shapes of the curves. So one area of mathematical modeling I think would be pretty valuable would be

  1. Figure out what shapes of curves lead to gaining orders of magnitude more capabilities in a short period of time, given constant hardware
  2. The same question, but given the ability to rent or buy more hardware
  3. The same question, but now it invest in improving chip fabs, with the same increase in investment required for each improvement as we have previously observed for chip fabs
  4. What do the empirical scaling laws for deep learning look like? Do they look like they come in under the curves from 1-3? What if we look at the change in the best scaling laws over time -- where does that line point?
  5. Check whether your model now says that we should have been eaten by a recursively self improving AI in 1982. If it says that, the model may require additional work.

I will throw in an additional $300 bounty for an explicit model of this specific question, subject to the usual caveats (payable to only one person, can't be in a sanctioned country, etc), because I personally would like to know.

Edit: Apparently Tyler Cowen didn't actually bounty this. My $300 bounty offer stands but you will not be getting additional money from Tyler it looks like.

Comment by faul_sname on AGI-Automated Interpretability is Suicide · 2023-05-11T18:42:16.149Z · LW · GW

A system that looks like "actively try to make paperclips no matter what" seems like the sort of thing that an evolution-like process could spit out pretty easily. A system that looks like "robustly maximize paperclips no matter what" maybe not so much.

I expect it's a lot easier to make a thing which consistently executes actions which have worked in the past than to make a thing that models the world well enough to calculate expected value over a bunch of plans and choose the best one, and have that actually work (especially if there are other agents in the world, even if those other agents aren't hostile -- see the winner's curse).

Comment by faul_sname on Gradient hacking via actual hacking · 2023-05-10T20:05:44.835Z · LW · GW

Yeah if outputs of the training process are logged and processed by insecure software (which includes pretty much any software that handles "text" instead of "bytes") I think it's safe to say that superhuman-hackerbot-which-controls-its-own-outputs pwns the log processing machine.

BTW of possible interest to you is Automated Repair of Binary and Assembly Programs for Cooperating Embedded Devices. It's from 2013 so there are probably better examples by now, but the key passage (well it's buried deep within a paragraph of the "Limitations and Caveats" section, but it's key for this context) is:

The fine granularity of repairs at the ASM and ELF levels may be a poor match for conventional test suites. For example, we have observed ASM-level repairs that change the calling convention of one particular function. Such a repair has no direct representation at the C source level, and a test suite designed to maximize statement coverage (for example) may not speak to the validity of such a repair. Producing efficient test suites that give confidence that an implementation adheres to its specification remains an open problem in software engineering. Our work shares this general weakness with all other approaches that use test suites or workloads to validate candidate repairs (e.g., Clearview [26] and GenProg [35]). In this regard, sandboxing is crucial: we have observed ASM variants that subvert the testing framework by deleting key test files, leading to perfect fitness for all subsequent variants until the test framework is repaired.

tl;dr: If you train a model to avoid test failures, and don't sandbox it properly, it will figure out that the tests won't fail if there are no tests. This is not a hypothetical future scenario that might happen or might be impossible, this is a thing that has already happened (10 years ago, even).

Comment by faul_sname on Gradient hacking via actual hacking · 2023-05-10T05:45:10.128Z · LW · GW

If the threat model is literally "a pytorch bug allows a model like your nanoGPT example, which is made of simple tensor operations, to access memory from outside the process, via a bug that is triggered when the tensors have specific values", that nanoGPT script looks pretty safe.

If your threat model includes "a user on that machine does a cat ckpt.pt and that machine is vulnerable to something like CVE-2019-9535", then that script is unsafe if the model can precisely control at the bit level a modestly long (hundreds of bytes) contiguous section of its own weights (caveat: it can actually only control 8 of every 10 bytes this way in the particular encoding nanoGPT uses. Though that's still enough to fit escape sequences so it's probably still enough to pwn a vulnerable terminal emulator).

If your threat model is "the user will give a model-under-training the ability to execute code, and try to create a gradient where the model executes the most useful code" then the model will be able to execute code and it's not even worthy of the word "hack" because that's what the programmer meant to do.

Comment by faul_sname on Prizes for matrix completion problems · 2023-05-08T21:14:24.813Z · LW · GW

I think running a single strain minimization iteration on m points in n dimensions takes O(m*n) steps. So there would need to be some reason to expect that it would converge in some constant (though possibly large) number of steps.

Unless you're saying "for each node, run the strain minimization step until it converges, then do the same for each subsequent node". I don't know if the greedy algorithm works there, but if it does then maybe?

Also I kinda expect that if there's something that works in O(n*m*log(m)) that's probably fine.

(and yeah, "try the greedy exact solution for each node" was my "dumb solution" attempt).

Comment by faul_sname on LLM cognition is probably not human-like · 2023-05-08T05:42:15.649Z · LW · GW

Suppose for concreteness, on a specific problem (e.g. Python interpreter transcript prediction), GPT-3 makes mistakes that look like humans-making-snap-judgement mistakes, and then GPT-4 gets the answer right all the time. Or, suppose GPT-5 starts playing chess like a non-drunk grandmaster.

Would that result imply that the kind of cognition performed by GPT-3 is fundamentally, qualitatively different from that performed by GPT-4? Similarly for GPT-4 -> GPT-5.

In the case of the Python interpreter transcript prediction task, I think if GPT-4 gets the answer right all the time that would indeed imply that GPT-4 is doing something qualitatively different than GPT-3. I don't think it's actually possible to get anywhere near 100% accuracy on that task without either having access to, or being, a Python interpreter.

Likewise, in the chess example, I expect that if GPT-5 is better at chess than GPT-4, that will look like "an inattentive and drunk super-grandmaster, with absolutely incredible intuition about the relative strength of board-states, but difficulty with stuff like combinations (but possibly with the ability to steer the game-state away from the board states it has trouble with, if it knows it has trouble in those sorts of situations)". If it makes the sorts of moves that human grandmasters play when they are playing deliberately, and the resulting play is about as strong as those grandmasters, I think that would show a qualitatively new capability.

Also, my model isn't "GPT's cognition is human-like". It is "GPT is doing the same sort of thing humans do when they make intuitive snap judgements". In many cases it is doing that thing far far better than any human can. If GPT-5 comes out, and it can natively do tasks like debugging a new complex system by developing and using a gears-level model of that system, I think that would falsify my model.

Also also it's important to remember that "GPT-5 won't be able to do that sort of thing natively" does not mean "and therefore there is no way for it to do that sort of thing, given that it has access to tools". One obvious way for GPT-4 to succeed at the "predict the output of running Python code" is to give it the ability to execute Python code and read the output. The system of "GPT-4 + Python interpreter" does indeed perform a fundamentally, qualitatively different type of cognition that "GPT-4 alone". But "it requires a fundamentally different type of cognition" does not actually mean "the task is not achievable by known means".

Also also also.,I mostly care about this model because it suggests interesting things to do on the mechanistic interpretability front. Which I am currently in the process of learning how to do. My personal suspicion is that the bags of tensors are not actually inscrutable, and that looking at these kinds of mistakes would make some of the failure modes of transformers no-longer-mysterious.

Comment by faul_sname on LLM cognition is probably not human-like · 2023-05-08T04:08:35.904Z · LW · GW

Great post!

Would a human, asked to predict the next token of any of the sequences above, be likely to come up with similar probability distributions for similar reasons? Probably not, though depending on the human, how much they know about Python, and how much effort they put into the making their prediction, the output that results from sampling from the human's predicted probability distribution might match the output of sampling text-davinci's distribution, in some cases. But the LLM and the human probably arrive at their probability distributions through vastly different mechanisms.

I don't think a human would come up with a similar probability distribution. But I think that's because asking a human for a probability distribution forces them to switch from the "pattern-match similar stuff they've seen in the past" strategy to the "build an explicit model (or several)" strategy.

I think the equivalent step is not "ask a single human for a probability distribution over the next token", but, instead, "ask a large number of humans who have lots of experience with Python and the Python REPL to make a snap judgement of what the next token is".

BTW rereading my old comment, I see that there are two different ways you can interpret it:

  1. "GPT-n makes similar mistakes to humans that are not paying attention[, and this is because it was trained on human outputs and will thus make similar mistakes to the ones it was trained on. If it were trained on something other than human outputs, like sensor readings, it would not make these sorts of mistakes.]".
  2. "GPT-n makes similar mistakes to humans that are not paying attention[, and this is because GPT-n and human brains making snap judgements are both doing the same sort of thing. If you took a human and an untrained transformer, and some process which deterministically produced a complex (but not pure noise) data stream, and converted it to an audio stream for the human and a token stream for the transformer, and trained them both on the first bit of it, they would both be surprised by similar bits of the part that they had not been trained on. ]."

I meant something more like the second interpretation. Also "human who is not paying attention" is an important part of my model here. GPT-4 can play mostly-legal chess, but I think that process should be thought of as more like "a blindfolded, slightly inebriated chess grandmaster plays bullet chess" not "a human novice plays the best chess that they can".

I could very easily be wrong about that! But it does suggest some testable hypotheses, in the form of "find some process for which generates a somewhat predictable sequence, train both a human and a transformer to predict that sequence, and see if they make the same types of errors or completely different types of errors".

Edit: being more clear that I appreciate the effort that went into this post and think it was a good post

Comment by faul_sname on Prizes for matrix completion problems · 2023-05-05T23:21:46.465Z · LW · GW

I think you can convert between the two representations in O(m) time, which would mean that any algorithm that solves either version in O(n*m) solves both in O(n*m).

Do you have some large positive and negative examples of the kinds of sparse matrix you're trying to check for the existence of a PSD completion on, or alternatively a method for generating such examples with a known ground truth? I have a really dumb idea for a possible algorithm here (that shamelessly exploits the exact shape of this problem in a way that probably doesn't generalize to being useful for broader problems like MDS) that I think would complete in approximately the time constraints you're looking for. It almost certainly won't work, but I think it's at least worth an hour of my time to check and figure out why (especially since I'm trying to improve my linear algebra skills anyway).

Edit: there's the obvious approach, which I'm trying, of "start with only 1s on the diagonal and then keep adding random entries until it no longer has a PSD completion, then removing random entries until it does, and repeat to build a test set" but I doubt that covers the interesting corners of the problem space.

Edit 2: the really dumb thing does not work. I think I haven't ruled out that a slightly less dumb approach could work though?

Edit 3: never mind, my really dumb "solution" requires inverting a matrix that is, in the worst case, nxn, if e.g. you have an input that looks like

1 n n n n n n
n 1 - - - - n
n - 1 - - - n
n - - 1 - - n
n - - - 1 - n
n - - - - 1 n
n n n n n n 1

you'll have to invert 6 2x2 matrices and one each of 3x3 to 7x7 matrices.

Comment by faul_sname on Prizes for matrix completion problems · 2023-05-05T01:05:19.206Z · LW · GW

When I tried asking about this problem, and referenced the above comment about being able to assume that all entries on the diagonal are positive, GPT-4 said the following:

If an n×n matrix is a symmetric positive semidefinite (PSD) matrix, it can be interpreted as a Gram matrix representing the dot products of a set of n vectors, and this configuration has 0 stress.

To understand this, let's first consider what a Gram matrix is. Given a set of n vectors v_1, v_2, ..., v_n in a d-dimensional space, the Gram matrix G is an n×n matrix where the entry G(i, j) is the dot product of vectors v_i and v_j:

G(i, j) = v_i • v_j

A Gram matrix is always symmetric (i.e., G(i, j) = G(j, i)) and positive semidefinite.

Now, let's consider the stress function in the context of multidimensional scaling (MDS). The stress function measures the difference between the given dissimilarities (distances) and the distances in the lower-dimensional space. When the stress is 0, this means that the distances in the lower-dimensional space exactly match the given dissimilarities.

Is GPT-4 correct? If so, am I interpreting it correctly that this problem could be rephrased as "given an incomplete set of m desired pairwise distances between n points, determine whether there exists some configuration of those points in an n-dimensional space such that the pairwise distances between those points are exactly as desired"?

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-02T04:57:54.626Z · LW · GW

What if you had assigned less than 0.01% to "RSI is so trivial that the first kludged loop to GPT-4 by an external user without access to the code or weights would successfully self-improve"?

I would think you were massively overconfident in that. I don't think you could make 10,000 predictions like that and only be wrong once (for a sense of intuition, that's like making one prediction per hour, 8 hours per day, 5 days a week for 5 years, and being wrong once).

Unless you mean "recursively self-improve all the way to godhood" instead of "recursively self-improve to the point where it would discover things as hard as the first improvement it found in like 10% as much time as it took originally".

For reference for why I did give at least 10% to "the dumbest possible approach will work to get meaningful improvement" -- humans spent many thousands of years not developing much technology at all, and then, a few thousand years ago, suddenly started doing agriculture and building cities and inventing tools. The difference between "humans do agriculture" and "humans who don't" isn't pure genetics -- humans came to the Americas over 20,000 years ago, agriculture has only been around for about 10,000 of those 20,000 years, and yet there were fairly advanced agricultural civilizations in the Americas thousands of years ago. Which says to me that, for humans at least, most of our ability to do impressive things comes from our ability to accumulate a bunch of tricks that work over time, and communicate those tricks to others.

So if it turned out that "the core of effectiveness for a language model is to make a dumb wrapper script and the ability to invoke copies of itself with a different wrapper script, that's enough for it to close the gap between the capabilities of the base language model and the capabilities of something as smart as the base language model but as coherent as a human", I would have been slightly surprised, but not surprised enough that I could have made 10 predictions like that and only been wrong about one of them. Certainly not 100 or 10,000 predictions like that.

Edit: Keep in mind that the dumbest possible approach of "define a JSON file that describes the tool and ensure that that JSON file has a link to detailed API docs does work for teaching GPT-4 how to use tools.

Comment by faul_sname on Natural Selection vs Gradient Descent · 2023-05-01T23:33:52.959Z · LW · GW

Yeah, I personally think the better biological analogue for gradient descent is the "run-and-tumble" motion of bacteria.

Take an e. coli. It has a bunch of flagella, pointing in all directions. When it rotates its flagella clockwise, each of them ends up pushing in a random direction, which results in the cell chaotically tumbling without going very far. When it rotates its flagella counterclockwise, they get tangled up with each other and all end up pointing the same direction, and the cell moves in a roughly straight line. The more attractants and fewer repellants there are, the more the cell rotates its flagella counterclockwise.

And that's it. That's the entire strategy by which e. coli navigates to food.

Here's a page with an animation of how this extremely basic behavior approximates gradient descent.

All that said, evolution looks kinda like gradient descent if you squint. For mind design, evolution would be gradient descent over the hyperparameters (and cultural evolution would be gradient descent over the training data generation process, and learning would be gradient descent over sensory data, and all of these gradients would steer in different but not entirely orthogonal directions).

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T21:55:20.485Z · LW · GW

Yeah. I had thought that you used the wording "don't update me at all" instead of "aren't at all convincing to me" because you meant something precise that was not captured by the fuzzier language. But on reflection it's probably just that language like "updating" is part of the vernacular here now.

Sorry, I had meant that to be a one-off side note, not a whole thing.

The bit I actually was surprised by was that you seem to think there was very little chance that the crude approach could have worked. In my model of the world, "the simplest thing that could possibly work" ends up working a substantial amount of the time. If your model of the world says the approach of "just piling more hacks and heuristics on top of AutoGPT-on-top-of-GPT4 will get it to the point where it can come up with additional helpful hacks and heuristics that further improve its capabilities" almost certainly won't work that's a bold and interesting advance prediction in my book.

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T19:13:11.624Z · LW · GW

Hm, I think I'm still failing to communicate this clearly.

RSI might be practical, or it might not be practical. If it is practical, it might be trivial, or it might be non-trivial.

If, prior to AutoGPT and friends, you had assigned 10% to "RSI is trivial", and you make an observation of whether RSI is trivial, you should expect that

  • 10% of the time, you observe that RSI is trivial. You update to 100% to "RSI is trivial", 0% "RSI is practical but not trivial", 0% "RSI is impractical".
  • 90% of the time, you observe that RSI is not trivial. You update to 0% "RSI is trivial", 67% "RSI is practical but not trivial", 33% "RSI is impractical".

By "does your model exclude the possibility of RSI-through-hacking-an-agent-together-out-of-LLMs", I mean the following: prior to someone first hacking together AutoGPT, you thought that there was less than a 10% chance that something like that would work to do the task of "make and test changes to its own architecture, and keep the ones that worked" well enough to be able to do that task better.

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T18:36:20.475Z · LW · GW

If you had said "very little evidence" I would not have objected. But if there are several possible observations which update you towards RSI being plausible, and no observations that update you against RSI being plausible, something has gone wrong.

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T18:31:49.179Z · LW · GW

If I had never seen a glider before, I would think there was a nonzero chance that it could travel a long distance without self-propulsion. So if someone runs the experiment of "see if you can travel a long distance with a fixed wing glider and no other innovations", I could either observe that it works, or observe that it doesn't.

If you can travel a long distance without propulsion, that obviously updates me very far in the direction of "fixed-wing flight works".

So by conservation of expected evidence, observing that a glider with no propulsion doesn't make it very far has to update me at least slightly in the direction of "fixed-wing flight does not work". Because otherwise I would expect to update in the direction of "fixed-wing flight works" no matter what observation I made.

Note that OP said "does not update me at all" not "does not update me very much" -- and the use of the language "update me" implies the strong "in a bayesian evidence sense" meaning of the words -- this is not a nit I would have picked if OP had said "I don't find the failures of autogpt and friends to self-improve to be at all convincing that RSI is impossible".

Comment by faul_sname on Will GPT-5 be able to self-improve? · 2023-05-01T17:56:52.645Z · LW · GW
  1. Initial attempts from API-users putting LLMs into agentic wrappers (e.g. AutoGPT, BabyAGI) don't seem to have made any progress.
  • I would not expect those attempts to work, and their failures don't update me at all against the possibility of RSI.

If the failures of those things to work don't update you against RSI, then if they succeed that can't update you towards the possibility of RSI.

I personally would not be that surprised, even taking into account the failures of the first month or two, if someone manages to throw together something vaguely semi-functional in that direction, and if the vaguely semi-functional version can suggest improvements to itself that sometimes help. Does your model of the world exclude that possibility?

Comment by faul_sname on Realistic near-future scenarios of AI doom understandable for non-techy people? · 2023-04-28T19:05:34.015Z · LW · GW

I don't think it is pointless to focus on specific ways everyone dies, unless there is a single strategy that addresses every possible way everyone dies.

If FOOM isn't likely but something like this is likely, it seems really unlikely to me that the approach of "continue to focus on strategies that rely on a single agent having a high level of control over the world" is still optimal (or, more accurately, it's probably still a good idea to have some people working on that but not all the people).

Comment by faul_sname on Contra Yudkowsky on Doom from Foom #2 · 2023-04-27T09:28:13.153Z · LW · GW

"Work on the safety of an ecosystem made up of a large number of in-some-ways-superhuman-and-in-other-ways-not AIs" seems like a very different problem than "ensure that when you build a single coherent, effectively-omniscient agent, you give it a goal that does not ruin everything when it optimizes really hard for that goal".

There are definitely parallels between the two scenarios, but I'm not sure a solution for the second scenario would even work to prevent an organization of AIs with cognitive blind spots from going off the rails.

My model of jacob_cannell's model is that the medium-term future looks something like "ad-hoc organizations of mostly-cooperating organizations of powerful-but-not-that-powerful agents, with the first organization to reach a given level of capability being the one that focused its resources on finding and using better coordination mechanisms between larger numbers of individual processes rather than the one that focused on raw predictive power", and that his model of Eliezer goes "no, actually focusing on raw predictive power is the way to go".

And I think the two different scenarios do in fact suggest different strategies.

Comment by faul_sname on Mental Models Of People Can Be People · 2023-04-25T04:18:18.058Z · LW · GW

What are your reasons for thinking that mental models are closer to markov models than tulpas?

I think this may just be a case of the typical mind fallacy: I don't model people in that level of detail in practice and I'm not even sure I'm capable of doing so. I can make predictions about "the kind of thing a person might say" based on what they've said before, but those predictions are more at the level of turns-of-phrase and favored topics of conversation -- definitely nothing like "long conversations on a level above GPT-4".

The "why people value remaining alive" bit might also be a typical mind fallacy thing. I mostly think about personal identity in terms of memories + preferences.

I do agree that my memories alone living on after my body dies would not be close to immortality to me. However, if someone were to train a multimodal ML model that can produce actions in the world indistinguishable from the actions I produce (or even "distinguishable but very very close"), I would consider that to be most of the way to effectively being immortal, assuming that model were actually run and had the ability to steer the world towards states which it prefers. Conversely, I'd consider it effectively-death to be locked in a box where I couldn't affect the state of the outside world and would never be able to exit the box. The scenario "my knowledge persists and can be used by people who share my values" would be worse, to me, than remaining alive but better than death without preserving my knowledge for people who share my values (and by "share my values" I basically just mean "are not actively trying to do things that I disprefer specifically because I disprefer them").

Comment by faul_sname on Mental Models Of People Can Be People · 2023-04-25T03:14:25.159Z · LW · GW

My argument in this post is that there do exist mental models of people that are sufficiently detailed to qualify as conscious moral patients;

Sounds reasonable for at least some values of "sufficiently detailed". At the limit, I expect that if someone had a computer emulation of my nervous system and all sensory information it receives, and all outputs it produces, and that emulation was good enough to write about its own personal experience of qualia for the same reasons I write about it, that emulation would "have" qualia in the sense that I care about.

At the other limit, a markov model trained on a bunch of my past text output which can produce writing which kinda sorta looks like it describes what it's like to have qualia almost certainly does not "have" qualia in the sense that I care about (though the system-as-a-whole that produced the writing, i.e. "me originally writing the stuff" plus "the markov model doing its thing" does have qualia -- they live in the "me originally experiencing the stuff I wrote about" bit).

In between the two extremes you've got stuff like tulpas, which I suspect are moral patients to the extent that it makes sense to talk about such a thing. That said, a lot of the reasons humans want to continue their thread of experience probably don't apply to most tulpas (e.g. when a human dies, the substrate they were running on stops functioning, all their memories are lost, and they lose their ability to steer the world towards states they prefer whereas if a tulpa "dies" its memories are retained and its substrate remains intact, though it still I think loses its ability to steer the world towards its preferred states).

I am hesitant to condemn anything which looks to me like "thoughtcrime", but to the extent that anything could be a thoughtcrime, "create tulpas and then do things that deeply violate their preferences" seems like one of those things. So if you're doing that, maybe consider doing not-that?

I also argue that this is common enough that authors good at characterization probably frequently create and destroy such people; finally, I argue that this is a bad thing.

"Any mental model of a person" seems to me like drawing the line quite a bit further than it should be drawn. I don't think mental models actually "have experiences" in any meaningful sense -- I think they're more analogous to markov models than they are to brain emulations (with the possible exception of tulpas and things like that, but those aren't the sort of situations you find yourself in accidentally).

Comment by faul_sname on Making Nanobots isn't a one-shot process, even for an artificial superintelligance · 2023-04-25T02:21:07.924Z · LW · GW

If I understand your argument, it is as follows:

  1. Self-replicating nanotech that also does something useful and also also outcompetes biological life and also also also faithfully self-replicates (i.e. you don't end up in a situation where the nanobots that do the "replicate" task better at the cost of the "do something useful" task replicate better and end up taking over) is hard enough that even if it's technically physically possible it won't be the path that the minimum-viable-superintelligence takes to gaining power.
  2. There probably isn't any other path to "sufficient power in the physical world to make more computer chips" that does not route through "humans do human-like stuff at human-like speeds for you"
  3. That implies that the sequence of events "the world looks normal and not at all like all of the chip fabs are fully automated, and then suddenly all the humans die of something nobody saw coming" is unlikely to happen.
  4. But this is a contingent fact about the world as it is today, and it's entirely possible to screw up this nice state of affairs, accidentally or intentionally.
  5. Therefore, even if you think that you are on a path to a pivotal act, if your plan starts look like "and in step 3 I give my AI a fully-automated factory which can produce all components of itself given sufficient raw materials and power, and can also build a chip fab, and then in step 4 I give my AI instructions that it should perform an act which looks to me like a pivotal act, which it will surely do by doing something amazing with nanotech", you should stop and reevaluate your plan.

Does this sound like an accurate summary to you?

Also, as a side note is there accepted terminology to distinguish between "an act that the actor believes will be pivotal" and "an act that is in fact pivotal"? I find myself wanting to make that distinction quite a bit, and it would be nice if there were accepted terminology.

Comment by faul_sname on A Hypothetical Takeover Scenario Twitter Poll · 2023-04-24T21:43:27.971Z · LW · GW

But also I think that if your model doesn't explain why we don't see massively more of that sort of stuff coming from humans, that means your model has a giant gaping hole in the middle of it, and any conclusions you draw from that model should keep in mind that the model has a giant gaping hole in it.

(My model of the world has this giant gaping hole too. I would really love it if someone would explain what's going on there, because as far as I can tell from my own observations, the vulnerable world hypothesis is just obviously true, but also I observe very different stuff than I would expect to observe given the things which convince me that the vulnerable world hypothesis is true).

Comment by faul_sname on We Need To Know About Continual Learning · 2023-04-24T17:27:29.175Z · LW · GW

Ah, so the point was whether that had been explored publicly on the very largest language models that exist, because of the whole "sometimes approaches that didn't work at small scale start working when you throw enough compute at them" thing? Makes sense.

Comment by faul_sname on We Need To Know About Continual Learning · 2023-04-24T06:16:38.078Z · LW · GW

This was the very first thing I thought of when language models came to my attention as "hey this looks like it actually might be the thing that the future looks like" (years ago). Since I'm not particularly smart or particularly well-informed, I conclude that I was not the first person to come up with this idea (or the tenth, or even the ten-thousandth). I strongly suspect that the simplest possible approach of "just turn on backprop" was tried within the first couple of days of the weights of a GPT model being available. For context, nostalgebraist-autoresponder has been live on Tumblr since late 2019.

I do concur with you that this is an important thing to explore. However, I am quite confident that "do the thing that is obvious to someone with no background encountering the field for the first time" is not an effective approach.

When I briefly looked into the academic research on this, I picked up the following general impression:

  1. This is a task a lot of people have poured a lot of time into. Search terms are "online learning", "incremental learning", "continual learning", "active learning",
  2. The primary problem with this approach is that as the model learns new stuff, it forgets old stuff that it is no longer being trained on. The search term here is "catastrophic forgetting". There are also several less-critical problems which would still be blockers if catastrophic forgetting wasn't an issue, mostly related to the model going off the rails more and more over time - search terms include "bias amplification", "overfitting", and "hallucination". Some argue that this is also a problem in humans.
  3. There have been some clever attempts to get around this. One example of a particularly clever idea from French and Chater (2002) is "let's use a clever metric of how important the old stuff is to the network to try to get it to forget less stuff". I notice that this clever technique is not in use despite there being a Deepmind publication about it in 2017 . Search term: "elastic weight consolidation", and, in terms of the particular failure modes, I believe "task-agnostic/task-free continual learning" and "scalability".
  4. Lots of people have had the idea "well humans sleep and dream, maybe something like that?". Search terms: "knowledge distillation", "experience replay".
  5. Also lots of people have had the idea "well what if we hack on an external memory store in a completely unprincipled way". And this seems to be how AutoGPT works. Also people have tried to do it in a principled way, search term "memory-augmented neural networks".

Basically, the impression I've picked up is

  1. This key problem seems like it's really hard to solve in a principled way.
  2. Humans are complete disaster monkeys when it comes to ML research, and as such, if there was an obvious way to write a ML program on your desktop computer that rapidly bootstraps its way to godhood, someone would have already done it.
  3. Per 1 and 2, we totally have yet "explored (publicly?)" the potential of “switching on” backpropagation/training while in inference mode (if by "we" you include "the internet" and not just "lesswrong in particular").
Comment by faul_sname on Contra Yudkowsky on AI Doom · 2023-04-24T03:38:20.455Z · LW · GW

The argument that shifts me the most away from thinking of it with the shoggoth-mask analogy is the implication that a mask has a single coherent actor behind it. But if you can avoid that mental failure mode I think the shoggoth-mask analogy is basically correct.

Comment by faul_sname on Contra Yudkowsky on AI Doom · 2023-04-24T03:32:16.378Z · LW · GW

Yeah, that one's "the best example of the behavior that I was able to demonstrate from scratch with the openai playground in 2 minutes" not "the best example of the behavior I've ever seen". Mostly the instances I've seen were chess-specific results on a model that I specifically fine-tuned on Python REPL transcripts that looked like

>>> import chess
>>> board = chess.Board()
>>> board.push_san('Na3')
Move.from_uci('b1a3')
>>> print(board.piece_at(chess.parse_square('b1')))

and it would print N instead of None (except that in the actual examples it mostly was a much longer transcript, and it was more like it would forget where the pieces were if the transcript contained an unusual move or just too many moves).

For context I was trying to see if a small language model could be fine-tuned to play chess, and was working under the hypothesis of "a Python REPL will make the model behave as if statefulness holds".

And then, of course, the Othello paper came out, and bing chat came out and just flat out could play chess without having been explicitly trained on it, and the question of "can a language model play chess" became rather less compelling because the answer was just "yes".

But that project is where a lot of my "the mistakes tend to look like things a careless human does, not weird alien mistakes" intuitions ultimately come from.

Comment by faul_sname on Contra Yudkowsky on AI Doom · 2023-04-24T02:36:03.324Z · LW · GW

One thing that I have observed, working with LLMs, is that when they're predicting the next token in a Python REPL they also make kinda similar mistakes to the ones that a human who wasn't paying that much attention would make. For example, consider the following

>>> a, b = 3, 5    # input
>>> a + b          # input
8                  # input
>>> a, b = b, a    # input
>>> a * b          # input
15                 # prediction (text-davinci-003, temperature=0, correct)
>>> a / b          # input
1.0                # prediction (text-davinci-003, temperature=0, incorrect but understandable mistake)
>>> a              # input
5                  # prediction (text-davinci-003, temperature=0, correct)
>>> a / b          # input
1.0                # prediction (text-davinci-003, temperature=0, incorrect but understandable mistake)
>>> a              # input
5                  # prediction (text-davinci-003, temperature=0, correct)
>>> a / b          # input
1.0                # prediction (text-davinci-003, temperature=0, incorrect but understandable mistake)
>>> b              # input
3                  # prediction (text-davinci-003, temperature=0, correct)
>>> a / b          # input
1.6666666666666667 # prediction (text-davinci-003, temperature=0, now correct -- looks like a humanish "oh whoops lol")

I expect that most examples in the training data of "this looks like an interaction with the Python REPL" were in fact interactions with the Python REPL. To the extent that GPT-N models do make human-like mistakes when predicting non-human-like data (instead of predicting that non-human-like data correctly, which is what their loss function wants them to do), I think that does serve as nonzero evidence that their cognition is "human-like" in the specific narrow sense of "mirroring our quirks and limitations".

More generally, I think it's particularly informative to look at the cases where the thing GPT-n does is different than the thing its loss function wants it to do. The extent to which those particular cases look like human failure modes is informative to how "human-like" the cognition of GPT-n class models (as a note, the SolidGoldMagikarp class of failure modes is extremely non human-like, and so that observation caused me to update more in the "shoggoth which can exhibit human behavior among many other behaviors" view. But I haven't actually seen a whole lot of failure modes like that in normal use, and I have seen a bunch of human-not-paying-attention type failure modes in normal use).

Comment by faul_sname on Contra Yudkowsky on AI Doom · 2023-04-24T02:17:13.109Z · LW · GW

I conclude something more like "the brain consumes perhaps 1 to 2 OOM less energy than the biological limits of energy density for something of its size, but is constrained to its somewhat lower than maximal energy density due in part to energy availability considerations" but I suspect that this is more of a figure/ground type of disagreement about which things are salient to look at vs a factual disagreement.

That said @jacob_cannell is likely to be much more informed in this space than I am -- if the thermodynamic cooling considerations actually bind much more tightly than I thought, I'd be interested to know that (although not necessarily immediately, I expect that he's dealing with rather a lot of demands on his time that are downstream of kicking the hornet's nest here).

Comment by faul_sname on Contra Yudkowsky on AI Doom · 2023-04-24T01:56:04.541Z · LW · GW

I don't think heat dissipation is actually a limiting factor for humans as things stand right now. Looking at the heat dissipation capabilities of a human brain from three perspectives (maximum possible heat dissipation by sweat glands across the whole body, maximum actual amount of sustained power output by a human in practice, maximum heat transfer from the brain to arterial blood with current-human levels of arterial bloodflow), none of them look to me to be close to the 20w the human brain consumes.

  • Based on sweat production of athletic people reaching 2L per hour, that gives an estimate of ~1kW of sustained cooling capacity for an entire human
  • 5 watts per kg seems to be pretty close to the maximum power output well-trained humans can actually output in practice for a full hour, so that suggests that a 70 kg human has at least 350 watts of sustained cooling capacity (and probably more, because the limiting factor does not seem to be overheating).
  • Bloodflow to the brain is about 45L / h, and brains tolerate temperature ranges of 3-4ºC, so working backwards from that we get that a 160W brain would reach temperatures of about 3ºC higher than arterial blood assuming that arterial bloodflow was the primary heat remover. Probably add in 20-100 watts to account for sweat dissipation on the head. And also the carotid artery is less than a cm in diameter, so bloodflow to the brain could probably be substantially increased if there were evolutionary pressure in that direction.

Brains in practice produce about 20W of heat, so it seems likely to me that energy consumption could probably increase by at least one order of magnitude without causing the brain to cook itself, if there was strong enough selection pressure to use that much energy (probably not two orders of magnitude though).

Getting rid of the energy constraint would help though. Proof of concept: ten humans take more energy to run than one human does, and can do more thinking than one human.

I do also find it quite likely that skull size is probably the most tightly binding constraint for humans -- we have smaller and very differently tuned neurons compared to other mammals, and I expect that the drive for smaller neurons in particular is downstream of space being very much at a premium, even more so than energy.

Further evidence for the "space, rather than energy expenditure or cooling, is the main binding constraint" hypothesis is the existence of Fontanelles -- human brains continue to grow after birth and the skull is not entirely solid in order to allow for that -- a skull that does not fully protect the brain seems like a very expensive adaptation, so it's probably buying something quite valuable.

Comment by faul_sname on Could a superintelligence deduce general relativity from a falling apple? An investigation · 2023-04-24T00:04:10.481Z · LW · GW

Response on reading the rest of the post: "lol".

Remember those pictures from earlier? Well I confess, I pulled a little trick on the reader here. I know for a fact that it is impossible to derive general relativity from those two pictures, because neither of them are real. The apple is from this CGI video, the grass is from this blender project. In neither case are Newtonian gravity or general relativity included in the relevant codes.

And if you are in a simulation, Ockham's razor becomes significantly stronger. If your simulation is on the scale of a falling apple, programming general relativity into it is a ridiculous waste of time and energy. The simulators will only input the simplest laws of physics necessary for their purposes.

I do agree that the AI would probably not deduce GR from these images, since I don't think there will be anything like absorption spectra or diffraction patterns accidentally encoded into this image, as there might be in a real image.

I don't think "That Alien Message" meant to say anything like "an alien superintelligence could deduce information that was not encoded in any form in the message it received". I think it mainly just meant to say "an entity with a bunch more available compute than a human, the ability to use that compute efficiently could extract a lot more information out of a message if it spent a lot of time on analysis than a human glancing at that image would extract".

Comment by faul_sname on Could a superintelligence deduce general relativity from a falling apple? An investigation · 2023-04-23T23:55:40.933Z · LW · GW

I encourage you to ponder these images in detail. Try and think for yourself the most plausible method for a superintelligence to deduce general relativity from one apple image and one grass image.

Stopping here to add some thoughts, since this is actually a topic that has come up before here, in actually a pretty similar form of "here is a large blob of binary data that is not in any standard format, can you tell me what it represents"?

A lot of this stuff is not actually original thought on my part, but instead corresponds to the process I used to figure out what was happening with that binary data blob the last time this came up. Spoiler warning if you want to try your own hand at the challenge from that post: the following links contain spoilers about the challenge.

  • Link 1 : note that the image labeled "option 2", and the associated description, should have been the very last thing, but Imgur is doing some dumb reordering thing. Aside from that everything is in the correct order.
  • Link 2: likewise is rendering in a slightly janky order.

The most plausible method to me would involve noticing that the RGB channels seem to be measuring three different "types" of "the same thing" in some sense -- I would be very unsurprised if "there are three channels, the first one seems to have a peak activation at 1.35x the [something] of the third, and the second one seems to have a peak activation at 1.14x the [something] of the third" is a natural conclusion from looking at that picture.

From there I actually don't have super strong intuitions about whether "build a 3-d model with a fairly accurate estimation of the shape of the frequency x intensity curve that results in the sensor readings at each pixel" is viable. If it is not viable, I think it mostly just ends there.

If it is viable, I think the next step depends on the nature of the light source.

If the light source is a black-body source (e.g. an incandescent bulb or the sun), I think the black-body spectrum is simple enough that it becomes an obvious hypothesis. In that case, the next question is whether the camera is good enough to detect things like the absorption spectrum of Hydrogen or Tungsten (for the sun or an incandescent bulb respectively), and, if so, whether the intelligence comes up with that hypothesis.

If the light source is a fluorescent light, there's probably some physics stuff you can figure out from that but I don't actually know enough physics to have any good hypotheses about what that physics stuff is.

The water droplets on the apple also make me expect that there may be some interesting diffraction or refraction or other interesting optical phenomena going on. But the water droplets aren't actually that many pixels, so there may just flat out not be enough information there.

The "blades of grass" image might tell you interesting stuff about how light works but I expect the "apple with drops of water" image to be a lot more useful probably.

Anyway, end braindump, time to read the rest of the post.

Comment by faul_sname on AI #8: People Can Do Reasonable Things · 2023-04-21T20:02:03.048Z · LW · GW

A lot of my hope for "humans do not go extinct within the next 50 years" looks something like that, yeah (a lot of the rest is in "it turns out that language models are just straightforwardly easy to align, and that it's just straightforwardly easy to teach them to use powerful tools"). If it turns out that "learn a heuristic that you should avoid irreversible actions that destroy complex and finely-tuned systems" is convergent that could maybe look like the "human reserve".

There's an anthropic argument that if that's what the future looks like, most humans that ever live would live on a human reserve, and as such we should be surprised that we're not. But I'm kinda suspicious of anthropic arguments.

Comment by faul_sname on AI #8: People Can Do Reasonable Things · 2023-04-21T07:10:06.732Z · LW · GW

The most compelling-to-me argument I've seen in that vein is that human civilization is currently, even without AI, on a trajectory to demand more and more energy, and eventually that will involve doing things on a scale sufficient to significantly change the amount of sunlight that reaches the surface of the Earth.

Humans probably won't do that, because we live here (though even there, emphasis on "probably" -- we're not exactly doing great in terms of handling climate change from accidentally changing the amount of CO2 in the atmosphere, and while that's unlikely to be an existential threat it's also not a very good sign for what will happen when humans eventually scale up to using 1000x as much energy).

An AI that runs on silicon can survive in conditions that humans can't survive in, and so its long-term actions probably look bad for life on Earth unless it specifically cares about leaving the Earth habitable.

This argument probably holds even in the absence of a single coherent AI that seizes control of the future, as long as things-which-need-earthlike-conditions don't retain enough control of the future.

My model is that the relevant analogy is not "human relationship with mice in general", it's with "human relationship with mice that live on a patch of ground where we want to build a chip fab, and also there's nowhere else for the mice to go".

Comment by faul_sname on A test of your rationality skills · 2023-04-20T22:13:58.766Z · LW · GW

I have opinions on the object-level here, but I concur that this is probably more of a test of "how familiar are you with what is and is not normal in a high-stakes cash game" and also "how familiar are you with the specific math" than of more general rationality.

Comment by faul_sname on "Aligned" foundation models don't imply aligned systems · 2023-04-19T23:27:21.567Z · LW · GW

All of those remarks look correct to me. Though "at the right level of generality and meta" is doing a lot of the work.

Comment by faul_sname on "Aligned" foundation models don't imply aligned systems · 2023-04-19T21:07:09.400Z · LW · GW

I claim that a company that uses the strategy of "come up with a target KPI and then implement every possible dark pattern which could plausibly lead to that KPI doing what we want it to do" will be outperformed by a company which uses the strategy of "come up with a bunch of things to do which will plausibly change the value of that KPI, try all of them out on a small scale, and then scale up the ones that work and scale down the ones that don't work".

For context for what's driving my intuitions here, I at one point worked at a startup where one of the cofounders did pretty much operate under the philosophy of "let's look at what other companies vaguely in our space do, paying particular attention to things which look like they're intended to trick customers out of their money, and implement those things in our own product as faithfully as possible". That strategy did in fact sometimes work, but in most cases it significantly hurt retention while being minimally useful for revenue (or, in a couple of cases, hurting both retention and revenue).

In the language of your steering systems post (thank you for writing that BTW), I expect a company where the pruning system is "try things out at small scale in the real world and iterate" will outperform even a human who has a very good world-model.

I actually suspect that this is a more general disagreement -- I think that, in complicated domains, the approach of "figure out what things work locally, do those things, and iterate" outperforms the approach of "look at the problem, work really hard on coming up with an explicit model of the reward landscape, and then do the optimal thing according to your model". Obviously you can outperform either approach in isolation by combining them, but I think that the best performance is generally far to the "try things and iterate" side. If that's still a thing you disagree with, even in that framing, I suspect that's a useful crux for us to explore more.

Edit: To be more explicit, I think that corporations are more powerful at steering the future towards a narrow space than individual humans because they are able to try out more things than any individual human, not because they have a better internal model of the world or better process for deciding which of two atomic, mutually exclusive plans to execute.

Comment by faul_sname on "Aligned" foundation models don't imply aligned systems · 2023-04-19T20:18:46.000Z · LW · GW

Corporations often control lots of resources that give them a wide range of options and opportunities to make decisions and take actions. But which actions to take are ultimately made by humans, and I believe the prevailing theory of corporate management is that a single CEO empowered to make decisions is better than any kind of decision-by-committee or consensus-based decision-making.

My model of how corporations work is very different than this, and I think our different models might be driving the disagreement. Specifically, my model is roughly as follows.

To the extent that big, successful companies have a "goal", that "goal" is usually to make enough money to grow and keep the lights on. A large corporation is much more successful at coming up with and performing actions which cause the corporation to keep existing and growing than any individual human in that corporation would be.

I agree that corporations are worse at explicit consequentialist reasoning than the individual humans inside the corporation. Most corporations do have things like "quarterly roadmaps" and "business plans" and other forms of explicit plans with contingency plans for how exactly they intend to navigate to a future desirable-to-them world-state, but I think the documented plan is in almost all cases worse than the plan that lives in the CEO's head.

I claim that this does not matter, because neither the explicit written plan nor the hidden plan in the CEO's head are the primary driver of what the corporation actually does. Instead, I think that almost all of the optimization pressure an organization comes from having a large number of people trying things, seeing what works and what doesn't work, and then doing more of the kinds of things that worked and fewer of the things that didn't work.

Let's consider a concrete example. Imagine a gym chain called "SF Fitness." About half of the revenue of "SF Fitness" comes from customers who are paying a monthly membership but have not attended the gym in quite a while. We observe that "SF Fitness" has made the process of cancelling your membership extremely obnoxious - to cancel their membership, a customer must navigate a confusing phone tree.

I expect that the "navigate a confusing phone tree" policy was not put in place by the CEO saying "we could make a lot of money by making it obnoxious to cancel your membership, so I will send a feature request to make a confusing phone tree". Instead, I expect that the origin of the confusing phone tree was something like

  1. The CEO set a KPI of "minutes of customer service time spent per month of membership". This created an incentive for specific employees, whose bonuses or status were tied to that KPI.
  2. One obvious way of reducing CS time was to add a simple menu for callers which allowed them to do certain things without needing to interact with a person (e.g. get gym hours), and which directed calls to the correct person if they did in fact need to go to a real person (e.g. correct language, department).
  3. There was continued testing of which options, and which order of options, worked best to reduce CS time.
  4. The CEO set a KPI of "fraction of customers cancelling their membership each month",
  5. Changes to the phone tree which increased cancellation rates were more likely to be reverted on account of making that metric worse than ones that decreased cancellation rates.
  6. After a bunch of iterations, "SF Fitness" had a deep, confusing, nonsensical phone tree without any single person having made the decision to create it. That phone tree is more effective per unit of spent resources at reducing cancellations than any strategy that the CEO would have come up with themselves.

As a side note, I don't think that prediction markets would actually improve the operation of most corporations by very much, relative to the current dominant approach of A/B testing. I can expand on why I think that if that's a crux for you.

TL;DR: I think corporations are far better than individual humans at steering the world state into states that look like "the corporation controls lots of resources", but worse at steering the world into arbitrary states.

Comment by faul_sname on "Aligned" foundation models don't imply aligned systems · 2023-04-19T16:55:05.050Z · LW · GW

Would you say that a parallel phenomenon that does not involve AI would be "a company may exhibit behavior that is not in the interests of its customers, even if every individual employee of that company is doing their job in a way that they believe is in the interests of the customer"?

My impression is that the original reason to avoid the "superhuman AI :: corporation" analogy was that the corporation was built out of humans, which are at least somewhat aligned with other humans, while the superhuman AI would not be. However, if we stipulate that the foundation model(s) is aligned with humans, but that the structure containing it may not be, the parallels seem much stronger to me.