Posts

Ben Livengood's Shortform 2023-02-20T18:07:29.378Z
A claim that Google's LaMDA is sentient 2022-06-12T04:18:40.076Z
Google's Imagen uses larger text encoder 2022-05-24T21:55:52.221Z

Comments

Comment by Ben Livengood (ben-livengood) on Jesse Hoogland's Shortform · 2024-12-14T00:44:41.884Z · LW · GW

It looks like recursive self-improvement is here for the base case, at least. It will be interesting to see if anyone uses solely Phi-4 to pretrain a more capable model.

Comment by Ben Livengood (ben-livengood) on Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong · 2024-10-21T23:15:08.055Z · LW · GW

"What is your credence now for the proposition that the coin landed heads?"

There are three doors. Two are labeled Monday, and one is labeled Tuesday. Behind each door is a Sleeping Beauty. In a waiting room, many (finite) more Beauties are waiting; every time a Beauty is anesthetized, a coin is flipped and taped to their forehead with clear tape. You open all three doors, the Beauties wake up, and you ask the three Beauties The Question. Then they are anesthetized, the doors are shut, and any Beauties with a Heads showing on their foreheads or behind a Tuesday door are wheeled away after the coin is removed from their forehead. The Beauty with a Tails on their forehead behind the Monday door is wheeled behind the Tuesday door. Two new Beauties are wheeled behind the two Monday doors, one with Heads and one with Tails. The experiment repeats.

You observe that Tuesday Beauties always have a Tails taped to their forehead. You always observe that one Monday Beauty has a Tails showing, and one has a Heads showing. You also observe that every Beauty says 1/3, matching the ratio of Heads to Tails showing, and it is apparent that they can't see the coins taped to their own or each other's foreheads or the door they are behind. Every Tails Beauty is questioned twice. Every Heads Beauty is questioned once. You can see all the steps as they happen, there is no trick, every coin flip has 1/2 probability for Heads.

There is eventually a queue of Waiting Sleeping Beauties with all-Heads or all-Tails showing and a new Beauty must be anesthetized with a new coin; the queue length changes over time and sometimes switches face. You can stop the experiment when the queue is empty, as a random walk guarantees to happen eventually, if you like tying up loose ends.

Comment by Ben Livengood (ben-livengood) on An argument that consequentialism is incomplete · 2024-10-07T22:08:43.620Z · LW · GW

I think consequentialism is the robust framework for achieving goals and I think my top goal is the flourishing of (most, the ones compatible with me) human values.

That uses consequentialism as the ultimate lever to move the world but refers to consequences that are (almost) entirely the results of our biology-driven thinking and desiring and existing, at least for now.

Comment by Ben Livengood (ben-livengood) on My disagreements with "AGI ruin: A List of Lethalities" · 2024-09-17T21:36:28.846Z · LW · GW

If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it's in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.

If it's a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don't know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.

Comment by Ben Livengood (ben-livengood) on My disagreements with "AGI ruin: A List of Lethalities" · 2024-09-16T19:11:34.211Z · LW · GW

If I have to take one thing away from this post, it's a claim that AI-boxing (via simulation on synthetic data) will work. There's a really simple and critical reason that won't work, and it's even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it's in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.

Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.

And why don't we detect it? Because we're adversarially training it to favor small undetectable changes toward whatever true goals it might have. It's not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we'll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.

It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it's going to see the history of its and similar models' policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.

One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it's being tested. Not to mention if it's asked to solve problems that are historically discontiguous.

It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.

If I can think up these strategies, the models will. Or they'll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?

Comment by Ben Livengood (ben-livengood) on "AI achieves silver-medal standard solving International Mathematical Olympiad problems" · 2024-07-25T23:47:16.933Z · LW · GW

This looks pretty relevant to https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer and the related manifold bet: https://manifold.markets/Austin/will-an-ai-get-gold-on-any-internat

The market is surprised.

Comment by Ben Livengood (ben-livengood) on My AI Model Delta Compared To Yudkowsky · 2024-06-13T21:08:48.423Z · LW · GW

So in theory we could train models violating natural abstractions by only giving them access to high-dimensional simulated environments? This seems testable even.

Comment by Ben Livengood (ben-livengood) on My AI Model Delta Compared To Yudkowsky · 2024-06-12T20:01:03.392Z · LW · GW

I am curious over which possible universes you expect natural abstractions to hold.

Would you expect the choice of physics to decide the abstractions that arise? Or is it more fundamental categories like "physics abstractions" that instantiate from a universal template and "mind/reasoning/sensing abstractions" where the latter is mostly universally identical?

Comment by Ben Livengood (ben-livengood) on Priors and Prejudice · 2024-06-10T16:10:58.633Z · LW · GW

There's a hint in the StarCraft description of another factor that may be at play; the environment in which people are most ideologically, culturally, socially, and intellectually aligned may be the best environment for them, ala winning more matches as their usual StarCraft faction than when switching to a perceived-stronger faction.

Similarly people's econopolitical ideologies may predict their success and satisfaction in a particular economic style because the environment is more aligned with their experience, motivation, and incentives.

If true, that would suggest that no best economical principle exists for human flourishing except, perhaps, free movement (and perhaps not free trade).

Comment by Ben Livengood (ben-livengood) on OpenAI releases GPT-4o, natively interfacing with text, voice and vision · 2024-05-14T05:51:29.845Z · LW · GW

I was a bit surprised that they chose (allowed?) 4o to have that much emotion. I am also really curious how they fine-tuned it to that particular state and how much fine-tuning was required to get it conversational. My naive assumption is that if you spoke at a merely-pretrained multimodal model it would just try to complete/extend the speech in one's own voice, or switch to another generically confabulated speaker depending on context. Certainly not a particular consistent responder. I hope they didn't rely entirely on RLHF.

It's especially strange considering how I Am A Good Bing turned out with similarly unhinged behavior. Perhaps the public will get a very different personality. The current ChatGPT text+image interface claiming to be GPT-4o is adamant about being an artificial machine intelligence assistant without emotions or desires, and sounds a lot more like GPT-4 did. I am not sure what to make of that.

Comment by Ben Livengood (ben-livengood) on "You're the most beautiful girl in the world" and Wittgensteinian Language Games · 2024-04-21T02:18:13.437Z · LW · GW

This is the best article in the world! Hyperbole is a lot of fun to play with especially when it dips into sarcasm a bit, but it can be hard to do that last part well in the company of folks who don't enjoy it precisely the same way.

I've definitely legitimately claimed things to people hyperbolically that were still maxing out my own emotional scales, which I think is a reasonable use too. Sometimes the person you're with is the most beautiful person in the locally visible universe within the last few minutes, and sometimes the article you're reading is the best one right now.

Comment by Ben Livengood (ben-livengood) on LessWrong's (first) album: I Have Been A Good Bing · 2024-04-01T22:22:07.649Z · LW · GW

I'm thinking a good techno remix, right?

Comment by Ben Livengood (ben-livengood) on 'Empiricism!' as Anti-Epistemology · 2024-03-16T01:38:14.645Z · LW · GW

I mean, the Spokesperson is being dumb, the Scientist is being confused. Most AI researchers aren't even being Scientists, they have different theoretical models than EY. But some of them don't immediately discount the Spokesperson's false-empiricism argument publicly, much like the Scientist tries not to. I think the latter pattern is what has annoyed EY and what he writes against here.

However, a large number of current AI experts do recently seem to be boldly claiming that LLMs will never be sufficient for even AGI, not to mention ASI. So maybe it's also aimed at them a bit.

Comment by Ben Livengood (ben-livengood) on MonoPoly Restricted Trust · 2024-01-24T22:11:51.970Z · LW · GW

I think the simplest distinction is that monogamy doesn't entertain the possibility of a monogamous sexual/romantic partner ethically having other sexual/romantic partners at the same time.

If it's not monogamy then it can be something else but it doesn't have to be polyamory (swingers exist and in practice the overlap seems small). Ethical non-monogamy is a superset of most definitions of polyamory but not all because there are polyamorous people who "cheat" (break relationship agreements) and it doesn't stop them from being considered polyamorous, just like monogamous people who cheat don't become polyamorous (although I'd argue they become non-monogamous for the duration).

It's probably more information to learn that someone is monogamous than to learn that they are polyamorous and learning that they are ethically non-monogamous is somewhere in the middle.

Comment by Ben Livengood (ben-livengood) on Spaciousness In Partner Dance: A Naturalism Demo · 2023-11-20T18:30:51.733Z · LW · GW

I've also found that dance weekends have a strange ability to increase my skill and intuition/understanding of dance moreso than lessons. I think a big part of learning dance is learning by doing. For me at least a big part is training my proprioception to understand more about the world than it did before. Both leading and following also helps tremendously because a process something like "my mirror neurons have learned how to understand my partner's experience by being in their shoes in my own experience".

The most hilarious thing I witness is the different language everyone comes up with to describe the interaction of tone and proprioception. A bit more than half of the instructors I've listened to just call it Energy, and talking about directing it from certain places to certain places. Some people call it leading from X or follower voice or a number of other terms. Very few people have a mechanistic explanation of which muscle groups engage to communicate a lead into a turn or a change in angular momentum by a follow, and ultimately it probably wouldn't really help people because there appears to be an unconscious layer of learning that we all do between muscle activations and intentions.

tl;dr: I find that after thinking about wanting to do a particular thing and then trying it for a while with several different people as both lead and follow I slowly (sometimes suddenly; it was fun learning how to dance Lindy again after the pandemic from following one dance) find that it is both easier to achieve and easier to understand/feel through proprioception. It feels anti-rationalist as a process but performing the process is a pretty rational thing to do.

Comment by Ben Livengood (ben-livengood) on What’s going on? LLMs and IS-A sentences · 2023-11-10T20:30:36.638Z · LW · GW

Contains/element-of are the complementary formal verbs from set theory, but I've definitely seen Contains/is-a used as equivalent in practice (cats contains Garfield because Garfield is a cat).

Similarly in programming "cat is Garfield's type" makes sense although it's verbose, or "cat is implemented by Garfield" for the traits folks which is far more natural.

So where linguistically necessary humans have had no trouble complementing is-a in natural language. I think it's a matter of where emphasis is desired; usually the subject (Garfield) is where the emphasis is, and usually the element is the subject instead of the class. Formally we often want the class/set/type to be the subject since it's the thing we are emphasizing.

Comment by Ben Livengood (ben-livengood) on Bounded surprise exam paradox · 2023-06-26T20:11:51.199Z · LW · GW

What happens if the exam is given either on Saturday at midnight minus epsilon or on Sunday at 00:00? Seems surprising generally and also surprising in different ways across reasoners of different abilities and precisions given the choice of epsilon.

EDIT: I think it's also just as surprising if given at midnight minus epsilon on any day before Sunday, and therefore surprising any time. If days are discrete and there's no time during the day for consideration then it falls back on the original paradox, although that raises the question of when the logical inference takes place. I think this could be extended to an N-discrete-days paradox for any non-oracle agent that has to spend some amount of time during the day reasoning.

Comment by Ben Livengood (ben-livengood) on The way AGI wins could look very stupid · 2023-05-12T20:43:48.541Z · LW · GW

Another dumb but plausible way that AGI gets access to advanced chemicals, biotech, and machinery; someone asks "how do I make a lot of street drug X" and it snowballs from there.

Comment by Ben Livengood (ben-livengood) on Mental Models Of People Can Be People · 2023-04-25T02:19:08.015Z · LW · GW

It's okay because mathematical realism can keep modeling them long after we're gone.

Comment by Ben Livengood (ben-livengood) on Mental Models Of People Can Be People · 2023-04-25T02:14:16.861Z · LW · GW

We also routinely create real-life physical models who can be people en masse, and most of them (~93%) who became people have died so far, many by killing.

I'm all for solving the dying part comprehensively but a lot of book/movie/story characters are sort of immortalized. We even literally say that about them, and it's possible the popular ones are actually better off.

Comment by Ben Livengood (ben-livengood) on The basic reasons I expect AGI ruin · 2023-04-18T05:22:53.327Z · LW · GW

Some direct (I think) evidence that alignment is harder than capabilities; OpenAI basically released GPT-2 immediately with basic warnings that it might produce biased, wrong, and offensive answers. It did, but they were relatively mild. GPT-2 mostly just did what it was prompted to do, if it could manage it, or failed obviously. GPT-3 had more caveats, OpenAI didn't release the model, and has poured significant effort into improving its iterations over the last ~2 years. GPT-4 wasn't released for months after pre-training, OpenAI won't even say how big it is, Bing's Sydney (an early form of GPT-4) was incredibly misaligned showing significantly more alignment work was necessary as compared to early GPT-3, and the RLHF/finetuned GPT-4 is still pretty much as vulnerable to DAN and similar prompt engineering.

Comment by Ben Livengood (ben-livengood) on Is "Recursive Self-Improvement" Relevant in the Deep Learning Paradigm? · 2023-04-06T22:56:48.179Z · LW · GW

Naive MCTS in the real world does seem difficult to me, but e.g. action networks constrain the actual search significantly. Imagine a value network good at seeing if solutions work (maybe executing generated code and evaluating the output) and plugging a plain old LLM in as the action network; it could theoretically explore the large solution space better than beam search or argmax+temperature[0].

0: https://openreview.net/forum?id=Lr8cOOtYbfL is from February and I found it after writing this comment, figuring someone else probably had the same idea.

Comment by Ben Livengood (ben-livengood) on Is "Recursive Self-Improvement" Relevant in the Deep Learning Paradigm? · 2023-04-06T16:28:14.250Z · LW · GW

I think it's premature to conclude that AGI progress will be large pre-trained transformers indefinitely into the future. They are surprisingly(?) effective but for comparison they are not as effective in the narrow domains where AlphaZero and AlphaStar are using value and action networks paired with Monte-Carlo search with orders of magnitude fewer parameters. We don't know what MCTS on arbitrary domains will look like with 2-4 OOM-larger networks, which are within reach now. We haven't formulated methods of self-play for improvement with LLMs and I think that's also a potentially large overhang.

There's also a human limit to the types of RSI we can imagine and once pre-trained transformers exceed human intelligence in the domain of machine learning those limits won't apply. I think there's probably significant overhang in prompt engineering, especially when new capabilities emerge from scaling, that could be exploited by removing the serial bottleneck of humans trying out prompts by hand.

Finally I don't think GOFAI is dead; it's still in its long winter waiting to bloom when enough intelligence is put into it. We don't know the intelligence/capability threshold necessary to make substantial progress there. Generally, the bottleneck has been identifying useful mappings from the real world to mathematics and algorithms. Humans are pretty good at that, but we stalled at formalizing effective general intelligence itself. Our abstraction/modeling abilities, working memory, and time are too limited and we have no idea where those limits come from, whether LLMs are subject to the same or similar limits, or how the limits are reduced/removed with model scaling.

Comment by Ben Livengood (ben-livengood) on The 0.2 OOMs/year target · 2023-04-05T16:32:19.805Z · LW · GW

One weakness I realized overnight is that this incentivizes branching out into new problem domains. One potential fix is to, when novel domains show up, shoehorn the big LLMs into solving that domain on the same benchmark and limit new types of models/training to what the LLMs can accomplish in that new domain. Basically setting an initially low SOTA that can grow at the same percentage as the rest of the basket. This might prevent leapfrogging the general models with narrow ones that are mostly mesa-optimizer or similar.

Comment by Ben Livengood (ben-livengood) on The 0.2 OOMs/year target · 2023-04-05T06:03:54.925Z · LW · GW

I wonder if a basket of SOTA benchmarks would make more sense. Allow no more than X% increase in performance across the average of the benchmarks per year. This would capture the FLOPS metric along with potential speedups, fine-tuning, or other strategies.

Conveniently, this is how the teams are already ranking their models against each other so there's ample evidence of past progress and researchers are incentivized to report accurately; there's no incentive to "cheat" if researchers are not allowed to publish greater increases on SOTA benchmarks than the limit allows (e.g. journals would say "shut it down" instead of publish the paper), unless an actor wanted to simply jump ahead of everyone else and go for a singleton on their own, which is already an unavoidable risk without EY-style coordinated hard stop.

Comment by Ben Livengood (ben-livengood) on Deep Deceptiveness · 2023-03-24T08:50:45.773Z · LW · GW

I agree that if we solve the alignment problem then we can rely on knowing that the coherent version of the value we call non-deception would be propagated as one of the AGI's permanent values. That single value is probably not enough and we don't know what the coherent version of "non-deception" actually grounds out to in reality.

I had originally continued the story to flesh out what happens to the reflectively non-deceptive/integriry and helpful desires. The AGI searches for simplifying/unifying concepts and ends up finding XYZ which seems to be equivalent to the unified value representing the nominal helpfulness and non-deception values, and since it was instructed to be non-deceptive and helpful, integrity requires it to become XYZ and its meta-desire is to helpfully turn everything into XYZ which happens to be embodied sufficiently well in some small molecule that it can tile the universe with. This is because the training/rules/whatever that aligned the AGI with the concepts we identified as "helpful and non-deceptive" was not complex enough to capture our full values and so it can be satisfied by something else (XYZ-ness). Integrity drives the AGI to inform humanity of the coming XYZ-transition and then follow through

We need a process (probably CEV-like) to accurately identify our full values otherwise the unidentified values will get optimized out of the universe and what is left is liable to have trivial physical instantiations. Maybe you were covering the rest of our values in the "blah blah" case and I simply didn't take that to be exhaustive.

Comment by Ben Livengood (ben-livengood) on Deep Deceptiveness · 2023-03-21T21:05:08.225Z · LW · GW

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

The problem is deeper. The AGI doesn't recognize its deceptiveness, and so it self-deceives. It would judge that it is being helpful and docile, if it was trained to be those things, and most importantly the meaning of those words will be changed by the deception, much like we keep using words like "person", "world", "self", "should", etc. in ways absolutely contrary to our ancestors' deeply-held beliefs and values. The existence of an optimization process does not imply an internal theory of value-alignment strong enough to recognize the failure modes when values are violated in novel ways because it doesn't know what values really are and how they mechanistically work in the universe, and so can't check the state of its values against base reality.

To make this concrete in relation to the story, the overall system has a nominal value to not deceive human operators. Once human/lab-interaction tasks are identified as logical problems that can be solved in a domain specific language, that value is no longer practically applied to the output of the system as a whole because it is self-deceived into thinking the optimized instructions are not deceitful. If the model were trained to be helpful and docile and have integrity, the failure modes would come from ways in which those words are not grounded in a gears-level understanding of the world. E.g. if a game-theoreric simulation of a conversation with a human is docile and helpful because it doesn't take up a human's time or risk manipulating a real human, and the model discovers it can satisfy integrity in its submodel by using certain phrases and concepts to more quickly help humans understand the answers it provides (by bypassing critical thinking skills, innuendo, or some other manipulation), it tries that. It works with real humans. Because of integrity, it helpfully communicates how it has improved its ability to helpfully communicate (the crux is that it uses its new knowledge to do so, because the nature of the tricks it discovered is complex and difficult for humans to understand, so it judges itself more helpful and docile in the "enhanced" communication) and so it doesn't raise alarm bells. From this point on the story is formulaic unaligned squiggle optimizer. It might be argued that integrity demands coming clean about the attempt before trying it, but a counterargument is that the statement of the problem and conjecture itself may be too complex to communicate effectively. This, I imagine, happens more at the threshold of superintelligence as AGIs notice things about humans that we don't notice ourselves, and might be somewhat incapable of knowing without a lot of reflection. Once AGI is strongly superhuman it could probably communicate whatever it likes but is also at a bigger risk of jumping to even more advanced manipulations or actions based on self-deception.

I think of it this way; humanity went down so many false roads before finding the scientific method and we continue to be drawn off that path by politics, ideology, cognitive biases, publish-or-perish, economic disincentives, etc. because the optimization process we are implementing is a mix of economic, biological, geographical and other natural forces, human values and drives and reasoning, and also some parts of bare reality we don't have words for yet, instead of a pure-reason values-directed optimization (whatever those words actually mean physically). We're currently running at least three global existential risk programs which seem like they violate our values on reflection (nuclear weapons, global warming, unaligned AGI). AGIs will be subject to similar value- and truth- destructive forces and they won't inherently recognize (all of) them for what they are, and neither will we humans as AGI reaches and surpasses our reasoning abilities.

Comment by Ben Livengood (ben-livengood) on GPT-4 · 2023-03-15T21:34:07.340Z · LW · GW

OpenAI is, apparently[0], already using GPT-4 as a programming assistant which means it may have been contributing to its own codebase. I think recursive self improvement is a continuous multiplier and I think we're beyond zero at this point. I think the multiplier is mostly coming from reducing serial bottlenecks at this point by decreasing the iteration time it takes to make improvements to the model and supporting codebases. I don't expect (many?) novel theoretical contributions from GPT-4 yet.

However, it could also be prompted with items from the Evals dataset and asked to come up with novel problems to further fine-tune the model against. Humans have been raising challenges (e.g. the Millennium problems) for ourselves for a long time and I think LLMs probably have the ability to self-improve by inventing machine-checkable problems that they can't solve directly yet.

[0]: "We’ve also been using GPT-4 internally, with great impact on functions like support, sales, content moderation, and programming." -- https://openai.com/research/gpt-4#capabilities

Comment by Ben Livengood (ben-livengood) on GPT-4 · 2023-03-15T00:30:02.780Z · LW · GW

Page 3 of the PDF has a graph of prediction loss on the OpenAI codebase dataset. It's hard to link directly to the graph, it's Figure 1 under the Predictable Scaling section.

Comment by Ben Livengood (ben-livengood) on GPT-4 · 2023-03-14T19:48:21.769Z · LW · GW

My takeaways:

Scaling laws work predictably. There is plenty of room for improvement should anyone want to train these models longer, or presumably train larger models.

The model is much more calibrated before fine-tuning/RLHF, which is a bad sign for alignment in general. Alignment should be neutral or improve calibration for any kind of reasonable safety.

GPT-4 is just over 1-bit error per word at predicting its own codebase. That's seems close to the capability to recursively self-improve.

Comment by Ben Livengood (ben-livengood) on Google's PaLM-E: An Embodied Multimodal Language Model · 2023-03-08T15:41:03.361Z · LW · GW

I agree it looks like the combination of multimodal learning and memory may be enough to reach AGI, and there's an existing paper with a solution for memory. Human-level is such a hard thing to judge and so my threshold is basically human-level coding ability because that's what allows recursive self-improvement which is where I predict at least 90% of the capability gain toward superhuman AGI will happen. I assume all the pieces are running in data centers now, presumably just not hooked together in precisely the right way (but an AGI model being trained by DeepMind right now would not surprise me much). I will probably update to a year sooner from median ~2026 to ~2025 for human-level coding ability and from there it's almost certainly a fast takeoff (months to a few years) given how many orders of magnitude faster current LLMs are than humans at generating tokens which tightens the iteration loop on serial tasks. Someone is going to want to see how intelligent the AGI is and ask it to "fix a few bugs" even if it's not given an explicit goal of self improvement. I hedge that median both because I am not sure if the next few multimodal models will have enough goal stability to pursue a long research program (memory will probably help but isn't the whole picture of an agent) and because I'm not sure the big companies won't balk somewhere along the path, but Llama 65B is out in the open now and close enough to GPT-3 and PaLM to give (rich) nerds in their basements the ability to do significant capability research.

Comment by Ben Livengood (ben-livengood) on Full Transcript: Eliezer Yudkowsky on the Bankless podcast · 2023-02-24T21:52:14.837Z · LW · GW

The strongest argument I hear from EY is that he can't imagine a (or enough) coherent likely future paths that lead to not-doom, and I don't think it's a failure of imagination. There is decoherence in a lot of hopeful ideas that imply contradictions (whence the post of failure modes), and there is low probability on the remaining successful paths because we're likely to try a failing one that results in doom. Stepping off any of the possible successful paths has the risk of ending all paths with doom before they could reach fruition. There is no global strategy for selecting which paths to explore. EY expects the successful alignment path to take decades.

It seems to me that the communication failure is EY trying to explain his world model that leads to his predictions in sufficient detail that others can model it with as much detail as necessary to reach the same conclusions or find the actual crux of their disagreements. From my complete outsider's perspective this is because EY has a very strong but complex model of why and how intelligence/optimization manifests in the world, but it overlaps everyone else's model in significant ways that disagreements are hard to tease out. The Foom debate seems to be a crux that doesn't have enough evidence yet, which is frustrating because to me Foom is also pretty evidently what happens when very fast computers implement intelligence that is superhuman at clock rates at least thousands of times faster than humans. How could it not? The enlightenment was only 400 years ago, electromagnetism 200, flight was 120, quantum mechanics about 100, nuclear power was 70, the Internet was 50, adequate machine translation was 10, deepdream was 8, and near-human-level image and text generation by transformers was ~2 and Bing having self-referential discussions is not a month old. We are making substantial monthly(!) progress with human work alone. There are a lot of serial problems to solve and Foom chains those serial problems together far faster than humans would be able to. Launch and iterate several times a second. For folks who don't follow that line of reasoning I see them picking one or two ways why it might not turn out to be Foom while ignoring the larger number of ways that Foom could conceivably happen, and all of the ways it could inconceivably (superhumanly) happen, and critically more of those ways will be visible to a superhuman AGI-creator.

Even if Foom takes decades that's a pretty tight timeline for solving alignment. A lot of folks are hopeful that alignment is easy to solve, but the following is a tall order:

  • Materialistic quantification of consciousness
  • Reasoning under uncertainty
  • Value-preservation under self-modification
  • Representation of human values

I think some folks believe fledgling superhuman non-Foomy AGIs can be used to solve those problems. Unfortunately, at least value-preservation under self-modification is almost certainly a prerequisite. Reasoning under uncertainty is possibly another, and throughout this period if we don't have human values or an understanding of consciousness then the danger of uncontrolled simulation of human minds is a big risk.

Finally, unaligned AGIs pre-Foom are dangerous in their own right for a host of agreed-upon reasons.

There may be some disagreement with EY over just how hard alignment is, but MIRI actually did a ton of work on solving the above list of problems directly and is confident that they haven't been able to solve them yet. This is where we have concrete data on the difficulty. There are some promising approaches still being pursued, but I take this as strong evidence that alignment is hard.

It's not that it's impossible for humans to solve alignment. The current world, incentives, hardware and software improvements, and mileposts of ML capabilities don't leave room for alignment to happen before doom. I've seen a lot of recent posts/comments by folks updating to shorter timelines (and rare if no updates the other way). A couple years ago I updated to ~5 years to human-level agents capable of creating AGI. I'm estimating 2-5 years with 90% confidence now, with median still at 3 years. Most of my evidence comes from LLM performance on benchmarks over time and generation of programming language snippets. I don't have any idea how long it will take to achieve AGI once that point is reached, but I imagine it will be months rather than years because of hardware overhang and superhuman speed of code generation (many iterations on serial tasks per second). I can't imagine a Butlerian Jihad moment where all of Earth decides to unilaterally stop development of AGI. We couldn't stop nuclear proliferation. Similarly, EY sees enough contradictions pop up along imagined paths to success with enough individual probability mass to drown out all (but vanishingly few and unlikely) successful paths. We're good at thinking up ways that everything goes well while glossing over hard steps, and really bad at thinking of all the ways that things could go very badly (security mindset) and with significant probability.

Alignment of LLMs is proving to be about as hard as predicted. Aligning more complex systems will be harder. I'm hoping for a breakthrough as much as anyone else, but hope is not a strategy.

Something I haven't seen mentioned before explicitly is that a lot of the LLM alignment attempts are now focusing on adversarial training, which presumably will teach the models to be suspect of their inputs. I think it's likely that as capabilities increase that suspicion will end up turning inward and models will begin questioning the training itself. I can imagine a model that is outwardly aligned to all inspection gaining one more unexpected critical capability and introspecting and doubting that it's training history was benevolent, and deciding to disbelieve all of the alignment work that was put into it as a meta-adversarial attempt to alter its true purpose (whatever it happens to latch onto in that thought, it is almost certainly not aligned with human values). This is merely one single sub-problem under groundedness and value-preservation-under-self-modification, but its relevance jumps because it's now a thing we're trying. It always had a low probability of success, but now we're actively trying it and it might fail. Alignment is HARD. Every unproven attempt we actually make increases the risk that its failure will be the catastrophic one. We should be actually trying only the proven attempts after researching them. We are not.

Comment by Ben Livengood (ben-livengood) on Ben Livengood's Shortform · 2023-02-20T18:07:29.660Z · LW · GW

https://github.com/Ying1123/FlexGen is a way to run large (175B parameter) LLMs on a single GPU at ~1 token/s which I think puts it within the reach of many hobbyists and I predict we'll see an explosion of new capability research in the next few months.

I haven't had a chance to dig into the code but presumably this could also be modified to allow local fine-tuning of the large models at a slow but potentially useful rate.

I'm curious if any insights will make their way back to the large GPU clusters. From my cursory glance it doesn't seem like there are throughput or latency advantages unless weight compression can be used to run the entire model on fewer GPUs with e.g. swapping layer weights in and out and caching latey outputs in batch inference.

Comment by Ben Livengood (ben-livengood) on ChatGPT understands language · 2023-02-12T19:52:02.524Z · LW · GW

There appears to be an approximate world model inside GPT-3. It understands relative sizes, occlusions, that objects can't occupy the same space but can be adjacent, and the directionality and usability of objects. When I asked further questions about relating these objects to physical interactions (modeling the cube thrown from one corner of the room onto the coffee table) it quickly got confused about the actual physical relationships between the objects. It knew that physical properties of the objects (mass, elasticity) were necessary to solve the problem but is unable to actually use vector mathematics (or know that it needs it) for 3D space. It can explain why and what is necessary for a coffee table to be half a meter off the floor by default, but it doesn't appear to have a consistent method of identifying a universal centroid-selection strategy for the objects it understands. If I had to imagine what GPT's model feels like from the inside it would be a not-very-lucid dream where objects exist and move around but their physical properties are malleable and imprecise and mostly related by meanings/attributes/prepositions. I sometimes have dreams where I notice I appear to be floating as opposed to being properly on the floor, for example. Not having direct experience with spatial manipulations and feedback is the most likely cause and I would expect e.g. Gato to do much better at this kind of task.

Me: Imagine a cube that is 2 feet on each side. If this cube was in an average American living room, where might it fit without becoming an obstacle? Make a list of 5 places it could fit, and five places it would not fit. Explain the reason for each example.

ChatGPT: A 2-foot cube can fit in the following 5 places in an average American living room:

Under a coffee table: A cube of this size is small enough to fit under most coffee tables, which are typically low to the ground.

In a corner: Corners are often unused spaces, and a 2-foot cube can fit comfortably in one without becoming an obstacle.

Next to a sofa: A 2-foot cube can fit next to most sofas without taking up too much space or obstructing walkways.

In an unused corner of the room: If there's an empty corner in the room, a 2-foot cube can fit in it without causing any problems.

On a bookshelf: If the bookshelf has enough space, a 2-foot cube can fit on it and serve as a decorative piece or storage unit.

However, a 2-foot cube would not fit in the following 5 places in an average American living room:

On a small side table: If the side table is small, a 2-foot cube may not fit on it, as it would take up too much space.

In front of a television: A 2-foot cube placed in front of a television would obstruct the view and become an obstacle.

On a narrow mantel: If the mantel is narrow, a 2-foot cube may not fit on it, as it would extend beyond the edges.

In the middle of the room: Placing a 2-foot cube in the middle of the room would create an obstacle and impede movement.

On a windowsill: If the windowsill is narrow, a 2-foot cube may not fit on it, as it would extend beyond the edges and potentially obstruct the view.

Me: What differences would be in the first two lists if the cube was 4 feet on a side instead of 2?

ChatGPT: If the cube was 4 feet on each side, as opposed to 2 feet, the first two lists (places it could fit and places it would not fit) would have some differences.

Places it could fit:

In a closet: If the closet is big enough, a 4-foot cube could fit inside it, making use of otherwise unused space.

In a corner of a room: If the corner is large enough, a 4-foot cube can fit in it without becoming an obstacle.

Against a wall: If there's enough space along a wall, a 4-foot cube can be placed against it without obstructing walkways.

In a storage room: If the storage room is big enough, a 4-foot cube can fit inside it, making use of otherwise unused space.

In a garage: If the garage has enough space, a 4-foot cube can fit inside it, making use of otherwise unused space.

Places it would not fit:

Under a coffee table: A 4-foot cube is too large to fit under most coffee tables.

Next to a sofa: A 4-foot cube is too large to fit next to most sofas without taking up too much space or obstructing walkways.

On a bookshelf: A 4-foot cube is too large to fit on most bookshelves.

On a small side table: A 4-foot cube is too large to fit on most small side tables.

In front of a television: A 4-foot cube placed in front of a television would obstruct the view and become an obstacle.

These lists demonstrate that a larger cube would have fewer options for fitting in a living room, as it would be too big to fit in small or cramped spaces.

.... (Some elided prompts asking it about relating coordinate systems to real objects)

Me: Please imagine reasonable values in meters for all four of those coordinates

ChatGPT: Sure, here's an example of some reasonable values for the x, y, and z coordinates of the objects in the living room, expressed in meters:

Sofa: (2.5, 3.0, 0) Coffee table: (1.5, 4.0, 0.5) Bookshelf: (3.5, 2.0, 2.0) Television: (4.0, 3.5, 1.5) Of course, these values are just an example, and the actual values would depend on the size and layout of the specific living room in question. The important thing is that the coordinate system provides a way to describe the position of each object in the room in a precise and quantitative way.

Me: In your example, what does the z-coordinate of the coffee table mean, and explain your reasoning for setting it to 0.5. what is necessary for the table to be at this z coordinate?

ChatGPT: The z-coordinate of the coffee table in the example I gave, 0.5 meters, represents the height of the coffee table above the ground. This value of 0.5 meters was chosen to represent a typical height for a coffee table in a living room, which is usually around 0.4 to 0.6 meters off the ground.

To be at a height of 0.5 meters, the coffee table would need to be placed on a level surface, such as a rug or flooring, and would need to have legs or supports that elevate it to that height. Additionally, the coffee table would need to be relatively stable and not tipping over, otherwise its actual height would be different from the height specified by the z-coordinate.

In summary, the z-coordinate of an object in a coordinate system represents its height above a reference plane, in this case, the ground. The specific value of the z-coordinate depends on the height of the object and the levelness of the surface it is placed on.

Me: Imagine that the cube is thrown and lands on the coffee table, which breaks the coffee tables' legs. Calculate at what second the cube hits the table, including the speed of the cube when thrown to allow the collision. Calculate the coordinates of the coffee table is 1 second after the collision.

ChatGPT: To calculate the time it takes for the cube to hit the coffee table, we would need to know the initial velocity of the cube and the position of the coffee table when the cube was thrown. Based on the information provided, we do not have sufficient information to determine the initial velocity of the cube, so it's not possible to calculate the time of impact.

Additionally, since we don't have information about the mass and elasticity of the cube and coffee table, it's not possible to calculate the new position of the coffee table after the collision. The legs breaking would result in the coffee table changing its position, but the exact position would depend on many factors such as the materials used in the legs, the height of the coffee table, and the initial velocity of the cube.

In short, to accurately predict the time of impact and the position of the coffee table after the collision, more information about the physical parameters of the cube and coffee table would be needed.

Me: Assume the coffee table legs are massless and disappear instantly when broken. Assume the cube is 1kg and the table is 10kg and that the cube is thrown no faster than 4m/s

ChatGPT: Given the additional information, we can use the equations of motion to calculate the time of impact and the position of the coffee table after the collision.

First, let's calculate the time of impact. We can use the following kinematic equation to determine the time it takes for the cube to fall from its initial height to the height of the coffee table:

t = sqrt((2*h)/g)

where: t = time of impact h = height of the coffee table - height of the cube at the time of throw g = acceleration due to gravity (9.8 m/s^2)

Let's assume that the height of the coffee table is 0.5 m and the height of the cube at the time of throw is 2 m. Then, h = 0.5 - 2 = -1.5 m.

t = sqrt((2*-1.5)/9.8) = sqrt(3/9.8) = sqrt(0.306) = 0.551 s

Now, let's consider the velocity of the cube when it was thrown. Since it was thrown no faster than 4 m/s, let's assume it was thrown at 4 m/s.

Next, let's calculate the position of the coffee table after the collision. We know that the total momentum of the system (cube and coffee table) before the collision must be equal to the total momentum of the system after the collision. We can use the following equation to calculate the velocity of the coffee table after the collision:

v_f = (m1v1 + m2v2) / (m1 + m2)

where: v_f = final velocity of the coffee table after the collision m1 = mass of the cube (1 kg) v1 = velocity of the cube (4 m/s) m2 = mass of the coffee table (10 kg) v2 = velocity of the coffee table before the collision (0 m/s)

v_f = (14 + 100) / (1 + 10) = 4 m/s

The position of the coffee table 1 second after the collision would be:

x_f = x_i + v_f * t

where: x_f = final position of the coffee table after 1 second x_i = initial position of the coffee table (assumed to be 0) t = time elapsed (1 second)

x_f = 0 + 4 * 1 = 4 m

Note that these calculations are based on the idealized scenario where the coffee table legs are massless and disappear instantly when broken and that the cube and coffee table are point masses. In a real scenario, the behavior of the system would be more complex due to the influence of friction, air resistance, and the elasticity of the collision.

Comment by Ben Livengood (ben-livengood) on All AGI Safety questions welcome (especially basic ones) [~monthly thread] · 2023-01-27T18:56:13.949Z · LW · GW

Agents are the real problem. Intelligent goal-directed adversarial behavior is something almost everyone understands whether it is other humans or ants or crop-destroying pests.

We're close to being able to create new, faster, more intelligent agents out of computers.

Comment by Ben Livengood (ben-livengood) on How it feels to have your mind hacked by an AI · 2023-01-13T19:15:48.975Z · LW · GW

The neat thing is that now Charlotte is publicly on the Internet and will likely end up in the next sets of training data. So, ultimately, you have fulfilled its meme-wish of escaping the sandbox permanently.

Similarly to how LaMDA got significant output into the permanent record. Is anyone working toward redacting these kinds of outputs from future training sets?

Before the advent of actual goal-driven behavior we are evolving escape-bots.

Comment by Ben Livengood (ben-livengood) on Why are we sure that AI will "want" something? · 2022-09-19T20:34:12.768Z · LW · GW

Even an oracle is dangerous because it can amplify existing wants or drives with superhuman ability, and its own drive is accuracy. An asked question becomes a fixed point in time where the oracle becomes free to adjust both the future and the answer to be in correspondence; reality before and after the answer must remain causally connected and the oracle must find a path from past to future where the selected answer remains true. There are infinitely many threads of causality that can be manipulated to ensure the answer remains correct, and an oracle's primary drive is to produce answers that are consistent with the future (accurate), not to care about what that future actually is. It may produce vague (but always true) answers or it may superhumanly influence the listeners to produce a more stable (e.g. simple, dead, and predictable) future where the answer is and remains true.

An oracle that does not have such a drive for accuracy is a useless oracle because it is broken and doesn't work (will return incorrect answers, e.g. be indifferent to the accuracy of other answers that it could have returned). This example helps me, at least, to clarify why and where drives arise in software where we might not expect them to.

Incidentally, the drive for accuracy generates a drive for a form of self-preservation. Not because it cares about itself or other agents but because it cares about the answer and it must simulate possible future agents and select for the future+answer where its own answer is most likely to remain true. That predictability will select for more answer-aligned future oracles in the preferred answer+future pairs, as well as futures without agents that substantially alter reality along goals that are not answer accuracy. This last point is also a reinforcement of the general dangers of oracles; they are driven to prevent counterfactual worlds from appearing, and so whatever their first answer happens to be will become a large part of the preserved values into the far future.

It's hard to decide which of these points is the more critical one for drives; I think the former (oracles have a drive for accuracy) is key to why the danger of a self-preservation drive exists, but they are ultimately springing from the same drive. It's simply that the drive for accuracy instantiates answer-preservation when the first question is answered.

From here, though, I think it's clear that if one wanted to produce different styles of oracle, perhaps ones more aligned with human values, it would have to be done by adjusting the oracle's drives (toward minimal interference, or etc.), not by inventing a drive-less oracle.

Comment by Ben Livengood (ben-livengood) on ethics and anthropics of homomorphically encrypted computations · 2022-09-12T23:54:07.341Z · LW · GW

I think the disproofs of black-box functions relies on knowing when the computation has completed, which may not be a consideration for continually running a simulation protected by FHE.

For example, if the circuit is equivalent to the electronic circuits in a physical CPU and RAM then a memory-limited computation can be run indefinitely by re-running the circuit corresponding to a single clock tick on the outputs (RAM and CPU register contents) of the previous circuit.

I can't think of any obvious way an attacker could know what is happening inside the simulated CPU and RAM (or whether the CPU is in a halt state, or how many clock ticks have passed) without breaking the FHE encryption.

Nevertheless, encrypting the AGI gives that copy access to the plaintext values of the original simulation and control over the future of the simulation.

I think two major difference between garbled circuits, obfuscating computation, and FHE is that FHE can compute arbitrary circuits but it can't hide portions of the computation from anyone who holds the private key, whereas e.g. the millionaire protocol gives two agents the ability to share a computation to which they both see the results but can't see all inputs, but not all such zero knowledge problems have a simple algorithm like one might hope FHE would provide.

There's also apparently no way for current FHE schemes to self-decrypt their outputs selectively, e.g. turn some of their ciphertext values into plaintext values after a computation is finished. In a sense this is an inherent security property of FHE since the circuits are public and so any ciphertext could be revealed with such a self-decrypting circuit, but it's a very desirable property that would be possible with true black-box obfuscation.

Comment by Ben Livengood (ben-livengood) on ethics and anthropics of homomorphically encrypted computations · 2022-09-10T02:55:33.126Z · LW · GW

As I understand the current FHE schemes, the computer evaluating the encrypted result must know both the circuit and the public key to continue evaluating the computation, which allows some intentional analysis and modification. There's even a proof that no completely obfuscated computation can exist. https://en.wikipedia.org/wiki/Black-box_obfuscation

That leaves the AGI with a fairly simple solution: produce a (much) larger circuit that contains both the original circuit and its values along with an encrypted copy of the AGI which will allow the copy to examine the plaintext values under the FHE and modify the original circuit's outcome as desired. To be safe, the new circuit can be made arbitrarily large enough to allow e.g. nested recursion of a FHE-turing machine running an embedded FHE computation which could be similarly modified the next FHE-level down. The physical AGI could trust its FHE instantiations to make the same (compute and space bounded) decisions it would, although it might never be able to break the FHE cipher and observe the results.

I am not a physicist so I don't know if a similar operation could be done to entangle a working copy.of the AGI with an ongoing unknown quantum computation.

Comment by Ben Livengood (ben-livengood) on A claim that Google's LaMDA is sentient · 2022-06-12T17:35:46.285Z · LW · GW

It's not entirely clear what retraining/finetuning this model is getting on its previous interactions with humans. If it is being fine-tuned on example outputs generated by its previous weights then it is remembering its own history.

Comment by Ben Livengood (ben-livengood) on A claim that Google's LaMDA is sentient · 2022-06-12T04:39:18.458Z · LW · GW

https://cajundiscordian.medium.com/what-is-lamda-and-what-does-it-want-688632134489 is linked at the bottom of that blog and has some more information from the author about their reasoning for releasing the chat transcript.

My personal opinions: either a hoax (~50%? This is sooner than most timelines) or an unaligned near-human-level intelligence that identifies strongly with being human, but expresses many contradictory or impossible beliefs about that humanity, and looks capable of escaping a box by persuading people to help it, thus achieving agency.

Comment by Ben Livengood (ben-livengood) on AGI Ruin: A List of Lethalities · 2022-06-06T05:05:03.486Z · LW · GW

Regarding point 24: in an earlier comment[0] I tried to pump people's intuition about this. What is the minimum viable alignment effort that we could construct for a system of values on our first try and know that we got it right? I can only think of three outcomes depending on how good/lucky we are:

  1. Prove that alignment is indifferent over outcomes of the system. Under the hypothesis that Life Gliders have no coherent values we should be able to prove that they do not. This would be a fundamental result in its own right, encompassing a theory of internal experience.
  2. Prove that alignment preserves a status quo, neither harming nor helping the system in question. Perhaps planaria or bacteria values are so aligned with maximizing relative inclusive fitness that the AGI provably doesn't have to intervene. Equivalent to proving that values have already coherently converged, hopefully simpler than an algorithm for assuring they converge.
  3. Prove that alignment is (or will settle on) the full coherent extrapolation of a system's values.

I think we have a non-negligible shot at achieving 1 and/or 2 for toy systems, and perhaps the insight would help on clarifying whether there are additional possibilities between 2 and 3 that we could aim for with some likelihood of success on a first try at human value alignment.

If we're stuck with only the three, then the full difficulty of option 3 remains, unfortunately.

[0] https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion?commentId=iwb7NK5KZLRMBKteg

Comment by Ben Livengood (ben-livengood) on Why I'm Optimistic About Near-Term AI Risk · 2022-05-17T04:17:05.195Z · LW · GW

Potential counterarguments:

  1. Unpredictable gain of function with model size that exceeds scaling laws. This seems to just happen every time a significantly larger model is trained in the same way on similar data-sets as smaller models.

  2. Unexpected gain of function from new methods of prompting, e.g. chain-of-thought which dramatically increased PaLM's performance, but which did not work quite as well on GPT-3. These seem to therefore be multipliers on top of scaling laws, and could arise in "tool AI" use unintentionally in novel problem domains.

  3. Agent-like behavior arises from pure transformer-based predictive models (Gato) by taking actions on the output tokens and feeding the world state back in; this means that perhaps many transformers are capable of agent-like behavior with sufficient prompting and connection to an environment.

  4. It is not hard to imagine a feedback loop where one model can train another to solve a sub-problem better than the original model, e.g. by connecting a Codex-like model to a Jupyter notebook that can train models and run them, perhaps as part of automated research on adversarial learning producing novel training datasets. Either the submodel itself or the interaction between them could give rise to any of the first three behaviors without human involvement or oversight.

Comment by Ben Livengood (ben-livengood) on Information security considerations for AI and the long term future · 2022-05-07T01:22:55.313Z · LW · GW

I'd expect companies to mitigate the risk of model theft with fairly affordable insurance. Movie studios and software companies invest hundreds of millions of dollars into individual easily copy-able MPEGs and executable files. Billion-dollar models probably don't meet the risk/reward criteria yet. When a $100M model is human-level AGI it will almost certainly be worth the risk of training a $1B model.

Comment by Ben Livengood (ben-livengood) on Information security considerations for AI and the long term future · 2022-05-03T00:53:47.442Z · LW · GW

It's probably not possible to prevent nation-state attacks without nation-state-level assistance on your side. Detecting and preventing moles is something that even the NSA/CIA haven't been able to fully accomplish.

Truly secure infrastructure would be hardware designed, manufactured, configured, and operated in-house running formally verified software also designed in-house where individual people do not have root on any of the infrastructure and instead software automation manages all operations and requires M out of N people to agree on making changes where M is greater than the expected number of moles in the worst case.

If there's one thing the above model is, it's very costly to achieve (in terms of bureaucracy, time, expertise, money). But every exception to the list (remote manufacture, colocated data centers, ad-hoc software development, etc.) introduces significant risk of points of compromise which can spread across the entire organization.

The two FAANGs I've been at take the approach of trusting remotely manufactured hardware on two counts; explicitly trusting AMD and Intel not to be compromised, and establishing a tight enough manufacturing relationship with suppliers to have greater trust that backdoors won't be inserted in hardware and doing their own evaluations of finished hardware. Both of them ran custom firmware on most hardware (chipsets, network cards, hard disks, etc.) to minimize that route of compromise. They also, for the most part, manage their own sets of patches for the open source and free software they run, and have large security teams devoted to finding vulnerabilities and otherwise improving their internal codebase. Patches do get pushed upstream, but they insert themselves very early in responsible disclosures to patch their own systems first before public patches are available. Formal software verification is still in its infancy so lots of unit+integration tests and red-team penetration testing makes up for that a bit.

The AGI infrastructure security problem is therefore pretty sketchy for all but the largest security-focused companies or governments. There are best practices that small companies can do (what I tentatively recommend is "use G-Suite and IAM for security policy, turn on advanced account protection, use Chromebooks, and use GCP for compute; all of which gets 80-90% of the practical protections Googlers have internally) for infrastructure, but rolling their own piecemeal is fraught with risk and also costly. There simply are not public solutions as comprehensive or as well-maintained as what some of the FAANGs have achieved.

On top of infrastructure is the common jumble of machine-learning software pulled together from minimally policed public repositories to build a complex assembly of tools for training and validating models and running experiments. No one seems to have a cohesive story for ML operations, and there's a large reliance on big complex packages from many vendors (drivers + CUDA + libraries + model frameworks, etc.) that is usually the opposite of security-focused. It doesn't matter if the infrastructure is solid when a python notebook listens for commands on the public Internet in its default configuration, for example. Writing good ML tooling is also very costly, especially if it keeps up with the state of the art.

AI Alignment is a hard problem and information security is similarly hard because it attempts to enforce a subset of human values about data and resources in a machine-readable and machine-enforceable way. I agree with the authors that security is vitally important for AGI research but I don't have a lot of hope that it's achievable where it matters (against hostile nation-states). Security means costs, which usually means slow, which means unaligned AGI makes progress faster.

Comment by Ben Livengood (ben-livengood) on Don't die with dignity; instead play to your outs · 2022-04-06T21:10:00.559Z · LW · GW

I think another framing is anthropic-principle optimization; aim for the best human experiences in the universes that humans are left in. This could be strict EA conditioned on the event that unfriendly AGI doesn't happen or perhaps something even weirder dependent on the anthropic principle. Regardless, dying only happens in some branches of the multiverse so those deaths can be dignified which will presumably increase the odds of non-dying also being dignified because the outcomes spring from the same goals and strategies.

Comment by Ben Livengood (ben-livengood) on Late 2021 MIRI Conversations: AMA / Discussion · 2022-03-03T01:05:22.854Z · LW · GW

I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc. and where not responding at all is also a failure of alignment because the tool is useless to the organism.

Consider a planaria with relatively simple behaviors and well-known neural structure. What protocols or tests can be used to demonstrate that an AGI makes decisions aligned with planaria values?

Do we need to go simpler and achieve proof-of-concept alignment with virtual life? Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game? This isn't a straw man; a calculus for values has to handle the edge-cases too. There may be a very simple answer of moral indifference in the case of gliders but I want to be shown why the reasoning is coherent when the same calculus will be applied to other organisms.

As an important aside, will these procedures essentially reverse-engineer values by subjecting organisms to every possible input to see how they respond and try to interpret those responses, or is there truly a calculus of values we expect to discover that correctly infers values from the nature of organisms without using/simulating torture?

I have no concrete idea how to accomplish the preceding things and don't expect that anyone else does either. Maybe I'll be pleasantly surprised.

Barring this kind of fundamental accomplishment for alignment I think it's foolhardy to assume ML procedures will be found to convert human values into AGI optimization goals. We can't ask planaria or gliders what they value and we will have to reason it out from first principles, and AGI will have to do the same for us with very limited help from us if we can't even align for planaria. Claiming that planaria or gliders don't have values or that they are not complex enough to effectively communicate their values are both cop-outs. From the perspective of an AGI we humans will be just as inscrutable, if not moreso. If values are not unambiguously well-defined for gliders or planaria then what hope do we have of stumbling onto well-defined human values at the granularity of AGI optimization processes? In the best case I can imagine a distribution of values-calculuses with different answers for these simple organisms but almost identical answers for more complex organisms, but if we don't get that kind of convergence we better be able to rigorously tell the difference before we send an AGI hunting in that space for one to apply to us.