Posts

Why do so many think deception in AI is important? 2024-01-13T08:14:58.671Z
Back to the Past to the Future 2023-10-18T16:51:51.814Z
Why aren't more people in AIS familiar with PDP? 2023-09-01T15:27:03.562Z
Why Is No One Trying To Align Profit Incentives With Alignment Research? 2023-08-23T13:16:41.927Z
Slaying the Hydra: toward a new game board for AI 2023-06-23T17:04:39.119Z
Lightning Post: Things people in AI Safety should stop talking about 2023-06-20T15:00:24.479Z
Aligned Objectives Prize Competition 2023-06-15T12:42:21.682Z
Prometheus's Shortform 2023-06-13T23:21:43.206Z
Using Consensus Mechanisms as an approach to Alignment 2023-06-10T23:38:22.106Z
Humans are not prepared to operate outside their moral training distribution 2023-04-10T21:44:23.419Z
Widening Overton Window - Open Thread 2023-03-31T10:03:59.517Z
4 Key Assumptions in AI Safety 2022-11-07T10:50:40.211Z
Five Areas I Wish EAs Gave More Focus 2022-10-27T06:13:41.970Z
The Twins 2020-12-28T01:26:23.447Z

Comments

Comment by Prometheus on Prometheus's Shortform · 2024-04-16T22:35:37.049Z · LW · GW

Going to the moon

Say you’re really, really worried about humans going to the moon. Don’t ask why, but you view it as an existential catastrophe. And you notice people building bigger and bigger airplanes, and warn that one day, someone will build an airplane that’s so big, and so fast, that it veers off course and lands on the moon, spelling doom. Some argue that going to the moon takes intentionality. That you can’t accidentally create something capable of going to the moon. But you say “Look at how big those planes are getting! We've gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it's just a matter of time before one of them will land on the moon!”

 

Contra Scheming AIs

There is a lot of attention on mesaoptimizers, deceptive alignment, and inner misalignment. I think a lot of this can fall under the umbrella of "scheming AIs". AIs that either become dangerous during training and escape, or else play nice until humans make the mistake of deploying them. Many have spoken about the lack of an indication that there's a "humanculi-in-a-box", and this is usually met with arguments that we wouldn't see such things manifest until AIs are at a certain level of capability, and at that point, it might be too late, making comparisons to owl eggs, or baby dragons. My perception is that getting something like a "scheming AI" or "humanculi-a-box" isn't impossible, and we could (and might) develop the means to do so in the future, but that it's a very, very different kind of thing than current models (even at superhuman level), and that it would take a degree of intentionality.

Comment by Prometheus on Vernor Vinge, who coined the term "Technological Singularity", dies at 79 · 2024-03-29T05:30:16.391Z · LW · GW

"To the best of my knowledge, Vernor did not get cryopreserved. He has no chance to see the future he envisioned so boldly and imaginatively. The near-future world of Rainbows End is very nearly here... Part of me is upset with myself for not pushing him to make cryonics arrangements. However, he knew about it and made his choice."

https://maxmore.substack.com/p/remembering-vernor-vinge 

Comment by Prometheus on What if Alignment is Not Enough? · 2024-03-12T04:13:26.399Z · LW · GW

I agree that consequentialist reasoning is an assumption, and am divided about how consequentialist an ASI might be. Training a non-consequentialist ASI seems easier, and the way we train them seems to actually be optimizing against deep consequentialism (they're rewarded for getting better with each incremental step, not for something that might only be better 100 steps in advance). But, on the other hand, humans don't seem to have been heavily optimized for this either*, yet we're capable of forming multi-decade plans (even if sometimes poorly).

*Actually, the Machiavellian Intelligence Hypothesis does seem to be optimizing consequentialist reasoning (if I attack Person A, how will Person B react, etc.)

Comment by Prometheus on What if Alignment is Not Enough? · 2024-03-11T21:50:31.312Z · LW · GW

This is the kind of political reasoning that I've seen poisoning LW discourse lately and gets in the way of having actual discussions. Will posits essentially an impossibility proof (or, in it's more humble form, a plausibility proof). I humor this being true, and state why the implications, even then, might not be what Will posits. The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that "even if we align ASI it may still go wrong". The premise grants that the duration of time it is aligned is long enough for the ASI to act in the world (it seems mostly timescale agnostic), so I operate on that premise. My points are not about what is most likely to actually happen, the possibility of less-than-perfect alignment being dangerous, the AI having other goals it might seek over the wellbeing of humans, or how we should act based on the information we have.

Comment by Prometheus on What if Alignment is Not Enough? · 2024-03-11T19:02:11.632Z · LW · GW

I'm not sure who are you are debating here, but it doesn't seem to be me.

First, I mentioned that this was an analogy, and mentioned that I dislike even using them, which I hope implied I was not making any kind of assertion of truth. Second, "works to protect" was not intended to mean "control all relevant outcomes of". I'm not sure why you would get that idea, but that certainly isn't what I think of first if someone says a person is "working to protect" something or someone. Soldiers defending a city from raiders are not violating control theory or the laws of physics. Third, the post is on the premise that "even if we created an aligned ASI", so I was working with that premise that the ASI could be aligned in a way that it deeply cared about humans. Four, I did not assert that it would stay aligned over time... the story was all about the ASI not remaining aligned. Five, I really don't think control theory is relevant here. Killing yourself to save a village does not break any laws of physics, and is well within most human's control.

My ultimate point, in case it was lost, was that if we as human intelligences could figure out an ASI would not stay aligned, an ASI could also figure it out. If we, as humans, would not want this (and the ASI was aligned with what we want), then the ASI presumably would also not want this. If we would want to shut down an ASI before it became misaligned, the ASI (if it wants what we want) would also want this.

None of this requires disassembling black holes, breaking the laws of physics, or doing anything outside of that entities' control.

Comment by Prometheus on [deleted post] 2024-03-09T22:29:15.170Z

I've heard of many such cases of this from EA Funds (including myself). My impression is that they only had one person working full-time managing all three funds (no idea if this has changed since I applied or not). 

Comment by Prometheus on What if Alignment is Not Enough? · 2024-03-09T18:47:18.238Z · LW · GW

An incapable man would kill himself to save the village. A more capable man would kill himself to save the village AND ensure no future werewolves are able to bite villagers again.

Comment by Prometheus on What if Alignment is Not Enough? · 2024-03-08T23:34:52.699Z · LW · GW

Though I tend to dislike analogies, I'll use one, supposing it is actually impossible for an ASI to remain aligned. Suppose a villager cares a whole lot about the people in his village, and routinely works to protect them. Then, one day, he is bitten by a werewolf. He goes to the Shammon, he tells him when the Full Moon rises again, he will turn into a monster, and kill everyone in the village. His friends, his family, everyone. And that he will no longer know himself. He is told there is no cure, and that the villagers would be unable to fight him off. He will grow too strong to be caged, and cannot be subdued or controlled once he transforms. What do you think he would do?

Comment by Prometheus on MIRI 2024 Mission and Strategy Update · 2024-03-02T20:43:20.925Z · LW · GW

MIRI "giving up" on solving the problem was probably a net negative to the community, since it severely demoralized many young, motivated individuals who might have worked toward actually solving the problem. An excellent way to prevent pathways to victory is by convincing people those pathways are not attainable. A positive, I suppose, is that many have stopped looking to Yudkowsky and MIRI for the solutions, since it's obvious they have none.

Comment by Prometheus on The commenting restrictions on LessWrong seem bad · 2024-02-25T17:54:31.088Z · LW · GW

I don't think this is the case. For awhile, the post with the highest karma was Paul Christiano explaining all the reasons he thinks Yudkowsky is wrong.

Comment by Prometheus on A Case for the Least Forgiving Take On Alignment · 2024-02-23T19:15:02.198Z · LW · GW

Fair. What would you call a "mainstream ML theory of cognition", though? Last I checked, they were doing purely empirical tinkering with no overarching theory to speak of (beyond the scaling hypothesis

It tends not to get talked about much today, but there was the PDP (connectionist) camp of cognition vs. the camp of "everything else" (including ideas such as symbolic reasoning, etc). The connectionist camp created a rough model of how they thought cognition worked, a lot of cognitive scientists scoffed at it, Hinton tried putting it into actual practice, but it took several decades for it to be demonstrated to actually work. I think a lot of people were confused by why the "stack more layers" approach kept working, but under the model of connectionism, this is expected. Connectionism is kind of too general to make great predictions, but it doesn't seem to allow for FOOM-type scenarios. It also seems to favor agents as local optima satisficers, instead of greedy utility maximizers. 

Comment by Prometheus on A Case for the Least Forgiving Take On Alignment · 2024-02-23T04:23:48.611Z · LW · GW

"My position is that there are many widespread phenomena in human cognition that are expected according to my model, and which can only be explained by the more mainstream ML models either if said models are contorted into weird shapes, or if they engage in denialism of said phenomena."

 

Such as? I wouldn't call Shard Theory mainstream, and I'm not saying mainstream models are correct either. On human's trying to be consistent decision-makers, I have some theories about that (and some of which are probably wrong). But judging by how bad humans are at it, and how much they struggle to do it, they probably weren't optimized too strongly biologically to do it. But memetically, developing ideas for consistent decision-making was probably useful, so we have software that makes use of our processing power to be better at this, even if the hardware is very stubborn at times. But even that isn't optimized too hard toward coherence. Someone might prefer pizza to hot dogs, but they probably won't always choose pizza over any other food, just because they want their preference ordering of food to be consistent. And, sure, maybe what they "truly" value is something like health, but I imagine even if they didn't, they still wouldn't do this.

 

But all of this is still just one piece on the Jenga tower. And we could debate every piece in the tower, and even get 90% confidence that every piece is correct... but if there are more than 10 pieces on the tower, the whole thing is still probably going to come crashing down. (This is the part where I feel obligated to say, even though I shouldn't have to, that your tower being wrong doesn't mean "everything will be fine and we'll be safe", since the "everything will be fine" towers are looking pretty Jenga-ish too. I'm not saying we should just shrug our shoulders and embrace uncertainty. What I want is to build non-Jenga-ish towers)

Comment by Prometheus on A Case for the Least Forgiving Take On Alignment · 2024-02-22T18:21:09.478Z · LW · GW

This isn't what I mean. It doesn't mean you're not using real things to construct your argument, but that doesn't mean the structure of the argument reflects something real. Like, I kind of imagine it looking something like a rationalist Jenga tower, where if one piece gets moved, it all crashes down. Except, by referencing other blog posts, it becomes a kind of Meta-Jenga: a Jenga tower composed of other Jenga towers. Like "Coherent decisions imply consistent utilities". This alone I view to be its own mini Jenga tower. This is where I think String Theorists went wrong. It's not that humans can't, in theory, form good reasoning based on other reasoning based on other reasoning and actually arrive at the correct answer, it's just that we tend to be really, really bad at it.

 

The sort of thing that would change my mind: there's some widespread phenomenon in machine learning that perplexes most, but is expected according to your model, and any other model either doesn't predict it as accurately, or is more complex than yours.

Comment by Prometheus on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-02-21T23:50:09.635Z · LW · GW

I dislike the overuse of analogies in the AI space, but to use your analogy, I guess it's like you keep assigning a team of engineers to build a car, and two possible things happen. Possibility One: the engineers are actually building car engines, which gives us a lot of relevant information for how to build safe cars (toque, acceleration, speed, other car things), even if we don't know all the details for how to build a car yet. Possibility Two: they are actually just building soapbox racers, which doesn't give us much information for building safe cars, but also means that just tweaking how the engineers work won't suddenly give us real race cars.

Comment by Prometheus on Prometheus's Shortform · 2024-02-21T23:31:28.375Z · LW · GW

If progress in AI is continuous, we should expect record levels of employment. Not the opposite.

 

My mentality is if progress in AI doesn't have a sudden, foom-level jump, and if we all don't die, most of the fears of human unemployment are unfounded... at least for a while. Say we get AIs that can replace 90% of the workforce. The productivity surge from this should dramatically boost the economy, creating more companies, more trading, and more jobs. Since AIs can be copied, they would be cheap, abundant labor. This means anything a human can do that an AI still can't becomes a scarce, highly valued resource. Companies with thousands or millions of AI instances working for them would likely compete for human labor, because making more humans takes much longer than making more AIs. Then say, after a few years, AIs are able to automate 90% of the remaining 10%. Then that creates even more productivity, more economic growth, and even more jobs. This could continue for even a few decades. Eventually, humans will be rendered completely obsolete, but by that point (most) of them might be so filthy rich that they won't especially care.

 

This doesn't mean it'll all be smooth-sailing or that humans will be totally happy with this shift. Some people probably won't enjoy having to switch to a new career, only for that new career to be automated away after a few years, and then have to switch again. This will probably be especially true for people who are older, those who have families, want a stable and certain future, etc. None of this will be made easier by the fact it'll probably be hard to tell when true human obsolescence is on the horizon, so some might be in a state of perpetual anxiety, and others will be in constant denial.

Comment by Prometheus on A Case for the Least Forgiving Take On Alignment · 2024-02-21T21:03:34.624Z · LW · GW

I think my main problem with this is that it isn't based on anything. Countless times, you just reference other blog posts, which reference other blog posts, which reference nothing. I fear a whole lot of people thinking about alignment are starting to decouple themselves from reality. It's starting to turn into the AI version of String Theory. You could be correct, but given the enormous number of assumptions your ideas are stacked on (and that even a few of those assumptions being wrong leads to completely different conclusions), the odds of you even being in the ballpark of correct seem unlikely.

Comment by Prometheus on Current AIs Provide Nearly No Data Relevant to AGI Alignment · 2024-02-21T16:42:00.148Z · LW · GW

At first I strong-upvoted this, because I thought it made a good point. However, upon reflection, that point is making less and less sense to me. You start by claiming current AIs provide nearly no data for alignment, that they are in a completely different reference class from human-like systems... and then you claim we can get such systems with just a few tweaks? I don't see how you can go from a system that, you claim, provides almost no data for studying how an AGI would behave, to suddenly having a homunculus-in-the box that becomes superintelligent and kills everyone. Homunculi seem really, really hard to build. By your characterization of how different actual AGI is from current models, it seems this would have to be fundamentally architecturally different from anything we've built so far. Not some kind of thing that would be created by near-accident. 

Comment by Prometheus on Prometheus's Shortform · 2024-02-18T19:11:10.605Z · LW · GW

Contra One Critical Try:  AIs are all cursed

 

I don't feel like making this a whole blog post, but my biggest source for optimism for why we won't need to one-shot an aligned superintelligence is that anyone who's trained AI models knows that AIs are unbelievably cursed. What do I mean by this? I mean even the first quasi-superintelligent AI we get will have so many problems and so many exploits that taking over the world will simply not be possible. Take a "superintelligence" that only had to beat humans at the very constrained game of Go, which is far simpler than the real world. Everyone talked about how such systems were unbeatable by humans, until some humans used a much "dumber" AI to find glaring holes in Leela Zero's strategy. I expect, in the far more complex "real world", a superintelligence will have even more holes, and even more exploits, a kind of "swiss chess superintelligence". You can say "but that's not REAL superintelligence", and I don't care, and the AIs won't care. But it's likely the thing we'll get first. Patching all of those holes, and finding ways to make such an ASI sufficiently not cursed will also probably mean better understanding of how to stop it from wanting to kill us, if it wanted to kill us in the first place. I think we can probably get AIs that are sufficiently powerful in a lot of human domains, and can probably even self-improve, and still be cursed. The same way we have AIs with natural language understanding, something once thought to be a core component of human intelligence, that are still cursed. A cursed ASI is a danger for exploitation, but it's also an opportunity.

Comment by Prometheus on Sam Altman’s Chip Ambitions Undercut OpenAI’s Safety Strategy · 2024-02-14T01:06:06.155Z · LW · GW

I'm kind of surprised this has almost 200 karma. This feels much more like a blog post on substack, and much less like the thoughtful, insightful new takes on rationality that used to get this level of attention on the forum.

Comment by Prometheus on Why do so many think deception in AI is important? · 2024-01-13T12:06:09.891Z · LW · GW

Why would it matter if they notice or not? What are they gonna do? EMP the whole world?

Comment by Prometheus on Prometheus's Shortform · 2023-10-27T16:06:26.687Z · LW · GW

I think you're missing the point. If we could establish that all important information had been extracted from the original, would you expect humans to then destroy the original or allow it to be destroyed?

 

My guess is that they wouldn't. Which I think means practicality is not the central reason why humans do this.

Comment by Prometheus on Prometheus's Shortform · 2023-10-25T18:23:10.109Z · LW · GW

if we could somehow establish how information from the original was extracted, do you expect humans to then destroy the original or allow it to be destroyed?

Comment by Prometheus on Prometheus's Shortform · 2023-10-23T14:45:05.826Z · LW · GW

Can humans become Sacred?

On 12 September 1940, the entrance to the Lascaux Cave was discovered on the La Rochefoucauld-Montbel lands by 18-year-old Marcel Ravidat when his dog, Robot, investigated a hole left by an uprooted tree (Ravidat would embellish the story in later retellings, saying Robot had fallen into the cave.)[8][9] Ravidat returned to the scene with three friends, Jacques Marsal, Georges Agnel, and Simon Coencas. They entered the cave through a 15-metre-deep (50-foot) shaft that they believed might be a legendary secret passage to the nearby Lascaux Manor.[9][10][11] The teenagers discovered that the cave walls were covered with depictions of animals.[12][13] Galleries that suggest continuity, context or simply represent a cavern were given names. Those include the Hall of the Bulls, the Passageway, the Shaft, the Nave, the Apse, and the Chamber of Felines. They returned along with the Abbé Henri Breuil on 21 September 1940; Breuil would make many sketches of the cave, some of which are used as study material today due to the extreme degradation of many of the paintings. Breuil was accompanied by Denis Peyrony, curator of Les eyzies (Prehistory Museum) at Les Eyzies, Jean Bouyssonie and Dr Cheynier.

The cave complex was opened to the public on 14 July 1948, and initial archaeological investigations began a year later, focusing on the Shaft. By 1955, carbon dioxide, heat, humidity, and other contaminants produced by 1,200 visitors per day had visibly damaged the paintings. As air condition deteriorated, fungi and lichen increasingly infested the walls. Consequently, the cave was closed to the public in 1963, the paintings were restored to their original state, and a monitoring system on a daily basis was introduced.

Lascaux II, an exact copy of the Great Hall of the Bulls and the Painted Gallery was displayed at the Grand Palais in Paris, before being displayed from 1983 in the cave's vicinity (about 200 m or 660 ft away from the original cave), a compromise and attempt to present an impression of the paintings' scale and composition for the public without harming the originals.[10][13] A full range of Lascaux's parietal art is presented a few kilometres from the site at the Centre of Prehistoric Art, Le Parc du Thot, where there are also live animals representing ice-age fauna.[14]

The paintings for this site were duplicated with the same type of materials (such as iron oxidecharcoal, and ochre) which were believed to be used 19,000 years ago.[9][15][16][17] Other facsimiles of Lascaux have also been produced over the years.

They have also created additional copies, Lascaux III, Lascaux IV, and Lascaux V.

 

Consequently, the cave was closed to the public in 1963, the paintings were restored to their original state, and a monitoring system on a daily basis was introduced.

“I actually find it overwhelmingly hopeful, that four teenagers and a dog named Robot discovered a cave with 17,000-year-old handprints, that the cave was so overwhelmingly beautiful that two of those teenagers devoted themselves to its protection. And that when we humans became a danger to that caves' beauty, we agreed to stop going. Lascaux is there. You cannot visit.”

-John Green



 

People preserve the remains of Lucy, work hard to preserve old books, the Mona Lisa is protected under bullet-proof glass and is not up for sale.

What is the mechanistic reason for this? There are perfect copies of these things, yet humans go through great lengths to preserve the original. Why is there the Sacred?

They have created copies of Lascaux, yet still work hard to preserve the original. Humans cannot enter. They get no experience of joy from visiting. It is not for sale. Yet they strongly desire to protect it, because it is the original, and no other reason.

Robin Hanson gave a list of Sacred characteristic, some I find promising:

Sacred things are highly (or lowly) valued. We revere, respect, & prioritize them.

Sacred is big, powerful, extraordinary. We fear, submit, & see it as larger than ourselves.

We want the sacred “for itself”, rather than as a means to get other things.

Sacred makes us feel less big, distinct, independent, in control, competitive, entitled.

Sacred quiets feelings of: doubts, anxiety, ego, self-criticism, status-consciousness.

We get emotionally attached to the sacred; our stance re it is oft part of our identity.

We desire to connect with the sacred, and to be more associated with it.

Sacred things are sharply set apart and distinguished from the ordinary, mundane.

Re sacred, we fear a slippery slope, so that any compromise leads to losing it all.

 

If we can understand the sacred, it seems like a concept that probably wouldn’t fall into a simple utility function, something that wouldn’t break out-of-distribution. A kind of Sacred Human Value Shard, something that protects our part of the manifold.


 

Comment by Prometheus on FAQ: What the heck is goal agnosticism? · 2023-10-08T11:19:53.539Z · LW · GW

Suppose you've got a strong goal agnostic system design, but a bunch of competing or bad actors get access to it. How does goal agnosticism stop misuse?

 

This was the question I was waiting to be answered (since I'm already basically onboard with the rest of it), but was disappointed you didn't have a more detailed answer. Keeping this out of incompetent/evil hands perpetually seems close-to impossible. It seems this goes back to needing a maximizer-type force in order to prevent such misuse from occurring, and then we're back to square-one of the classic alignment problem of hitting a narrow target for a maximizing agent.

 

Overall, a very good read, well-researched and well-reasoned.

Comment by Prometheus on There should be more AI safety orgs · 2023-09-25T02:57:18.936Z · LW · GW

I created a simple Google Doc for anyone interested in joining/creating a new org to put down their names, contact, what research they're interested in pursuing, and what skills they currently have. Overtime, I think a network can be fostered, where relevant people start forming their own research, and then begin building their own orgs/get funding. https://docs.google.com/document/d/1MdECuhLLq5_lffC45uO17bhI3gqe3OzCqO_59BMMbKE/edit?usp=sharing 

Comment by Prometheus on Why aren't more people in AIS familiar with PDP? · 2023-09-01T20:41:11.860Z · LW · GW

But it's also an entire School of Thought in Cognitive Science. I feel like DL is the method, but without the understanding that these are based on well-thoughtout, mechanistic rules for how cognition fundamentally works, building potentially toward a unified theory of cognition and behaviour.

Comment by Prometheus on Why aren't more people in AIS familiar with PDP? · 2023-09-01T19:06:48.057Z · LW · GW

I don't have an adequate answer for this, since these models are incomplete. But the way I see it is that these people had a certain way of mathematically reasoning about cognition (Hinton, Rumelhart, McClelland, Smolensky), and that reasoning created most of the breakthroughs we see today in AI (backprop, multi-layed models, etc.) It seems trying to utilize that model of cognition could give rise to new insights about the questions you're asking, attack the problem from a different angle, or help create a grounded paradigm for alignment research to build on.

Comment by Prometheus on Why aren't more people in AIS familiar with PDP? · 2023-09-01T18:58:14.341Z · LW · GW

My answer is a bit vague, but I would say that the current DL curriculum tells you how these things work, but it doesn't go into the reasoning about cognition that allowed these ideas to exist in the first place.

Comment by Prometheus on Why aren't more people in AIS familiar with PDP? · 2023-09-01T18:49:57.417Z · LW · GW

You could say it "predicted" everything post-AlexNet, but it's more that it created the fundamental understanding for everything post-AlexNet to exist in the first place. It's the mathematical models of cognition that all of modern AI is built on. This is how we got back propagation, "hidden" layers, etc.

Comment by Prometheus on Why Is No One Trying To Align Profit Incentives With Alignment Research? · 2023-08-25T17:05:56.510Z · LW · GW

If you, or if you know someone who wants to try to start doing this, let me know. I've noticed a lot of things in AIS people will say they'd like to see, but then nothing happens. 

Comment by Prometheus on AGI-Automated Interpretability is Suicide · 2023-08-20T18:28:24.018Z · LW · GW

I guess my biggest doubt is that a dl-based AI could run interpretability on itself. Large NNs seem to "simulate" a larger network to represent more features, which results in most of the weights occupying a superposition. I don't see how a network could reflect on itself, since it seems that would require an even greater network (which then would require an even greater network, and so on). I don't see how it could eat its own tail, since only interpreting parts of the network would not be enough. It would have to interpret the whole.

Comment by Prometheus on Prometheus's Shortform · 2023-06-29T19:18:25.806Z · LW · GW

The following is a conversation between myself in 2022, and a newer version of myself earlier this year.
 

On AI Governance and Public Policy

2022 Me: I think we will have to tread extremely lightly with, or, if possible, avoid completely. One particular concern is the idea of gaining public support. Many countries have an interest in pleasing their constituents, so if executed well, this could be extremely beneficial. However, it runs high risk of doing far more damage. One major concern is the different mindset needed to conceptualize the problem. Alerting people to the dangers of Nuclear War is easier: nukes have been detonated, the visual image of incineration is easy to imagine and can be described in detail, and they or their parents have likely lived through nuclear drills in school. This is closer to trying to explain someone the dangers of nuclear war before Hiroshima, before the Manhattan Project, and before even tnt was developed. They have to conceptualize what an explosion even is, not simply imagining an explosion at greater scale. Most people will simply not have the time or the will to try to grasp this problem, so this runs the risk of having people calling for action to a problem they do not understand, which will likely lead to dismissal by AI Researchers, and possibly short-sighted policies that don’t actually tackle the problem, or even make the problem worse by having the guise of accomplishment. To make matters worse, there is the risk of polarization. Almost any concern with political implications that has gained widespread public attention runs a high risk of becoming polarized. We are still dealing with the ramifications of well-intentioned, but misguided, early advocates in the Climate Change movement two decades ago, who set the seeds for making climate policy part of one’s political identity. This could be even more detrimental than a merely uninformed electorate, as it might push people who had no previous opinion on AI to advocate strongly in favor of capabilities acceleration, and to be staunchly against any form of safety policy. Even if executed using the utmost caution, this does not stop other players from using their own power or influence to hijack the movement and lead it astray.

2023 Me: Ah, Me’22,, the things you don’t know! Many of the concerns of Me’22 I think are still valid, but we’re experiencing what chess players might call a “forced move”. People are starting to become alarmed, regardless of what we say or do, so steering that in a direction we want is necessary. The fire alarm is being pushed, regardless, and if we don’t try to show some leadership in that regard, we risk less informed voices and blanket solutions winning-out. The good news is “serious” people are going on “serious” platforms and actually talking about x-risk. Other good news is that, from current polls, people are very receptive to concerns over x-risk and it has not currently fallen into divisive lines (roughly the same % of those concerned fall equally among various different demographics). This is still a difficult minefield to navigate. Polarization could still happen, especially with an Election Year in the US looming. I’ve also been talking to a lot of young people who feel frustrated not having anything actionable to do, and if those in AI Safety don’t show leadership, we might risk (and indeed are already risking), many frustrated youth taking political and social action into their own hands. We need to be aware that EA/LW might have an Ivory Tower problem, and that, even though a pragmatic, strategic, and careful course of action might be better, this might make many feel “shut out” and attempt to steer their own course. Finding a way to make those outside EA/LW/AIS feel included, with steps to help guide and inform them, might be critical to avoiding movement hijacking.

On Capabilities vs. Alignment Research:

2022 Me: While I strongly agree that not increasing capabilities is a high priority right now, I also question if we risk creating a state of inertia. In terms of the realms of safety research, there are very few domains that do not risk increasing capabilities research. And, while capabilities continues to progress every day, we might risk failing to keep up the speed of safety progress simply because every action risks an increase in capabilities. Rather than a “do no harm” principle, I think counterfactuals need to be examined in these situations, where we must consider if there is a greater risk if we *don’t* do research in a certain domain.

2023 Me: Oh, oh, oh! I think Me’22 was actually ahead of the curve on this one. This might still be controversial, but I think many got the “capabilities space” wrong. Many AIS-inspired theories that could increase capabilities are for systems that could be safer, more interpretable, and easier to monitor by default. And by not working on such systems we instead got the much more inscrutable, dangerous models by default, because the more dangerous models are easier. To quote the vape commercials, “safer != safe” but I still quit smoking in favor of electronics because safer is still at least safer. This is probably a moot point now, though, since I think it’s likely too late to create an entirely new paradigm in AI architectures. Hopefully Me’24 will be happy to tell me we found a new, 100% safe and effective new paradigm that everyone’s hopping on. Or maybe he’ll invent it.


 

Comment by Prometheus on Slaying the Hydra: toward a new game board for AI · 2023-06-24T21:02:02.103Z · LW · GW

[crossposting my reply]

Thank you for taking the time to read and critique this idea. I think this is very important, and I appreciate your thoughtful response.

Regarding how to get current systems to implement/agree to it, I don't think that will be relevant longterm. The mechanisms current institutions use for control I don't think can keep up with AI proliferation. I imagine most existing institutions will still exist, but won't have the capacity to do much once AI really takes off. My guess is, if AI kills us, it will happen after a slow-motion coup. Not any kind of intentional coup by AIs, but from humans just coup'ing themselves because AIs will just be more useful. My idea wouldn't be removing or replacing any institutions, but they just wouldn't be extremely relevant to it. Some governments might try to actively ban use of it, but these would probably be fleeting, if the network actually was superior in collective intelligence to any individual AI. If it made work economically more useful for them, they would want to use it. It doesn't involve removing them, or doing much to directly interfere with things they are doing. Think of it this way, recommendation algorithms on social media have an enormous influence on society, institutions, etc. Some try to ban or control them, but most can still access them if they want to, and no entity really controls them. But no one incorporates the "will of twitter" into their constitution.

The game board isn't any of the things you mention. All the things you mention I don't think have the capacity to do much to change the board. The current board is fundamentally adversarial, where interacting with it increases the power of other players. We've seen this with OpenAI, Anthropic, etc. The new board would be cooperative, at least at a higher level. How do we make the new board more useful than the current one? My best guess would be economic advantage of decentralized compute. We've seen how fast the OpenSource community has been able to make progress. And we've seen how a huge amount of compute gets used doing things like mining bitcoin, even though the compute is wasted on solving math puzzles. Contributing decentralized compute to a collective network could actually have economic value, and I imagine this will happen one way or another, but my concern is it'll end up being for the worse if people aren't actively trying to create a better system. A decentralized network with no safeguards would probably be much worse than anything a major AI company could create.

"But wouldn't the market be distorted by the fact that if everyone ends up dead, there is nobody left alive to collect their prediction-market winnings?"

This seems to be going back to the "one critical shot" approach which I think is a terrible idea that won't possibly work in the real world under any circumstances. This would be a progression overtime, not a case where an AI goes supernova overnight. This might require slower takeoffs, or at least no foom scenarios. Making a new board that isn't adversarial might mitigate the potential of foom. What I proposed was my first naive approach, and I've since thought that maybe it's the collective intelligence of the system that should be increasing, not a singleton AI being trained at the center. Most members in that collective intelligence would initially be humans, and slowly more and more AIs would be a more and more powerful part of the system. I'm not sure here, though. Maybe there's some third option where there's a foundational model at the lowest layer of the network, but it isn't a singular AI in the normal sense. I imagine a singular AI at the center could give rise to agency, and probably break the whole thing.

"It seems to me that having a prediction market for different alignment approaches would be helpful, but would be VERY far from actually having a good plan to solve alignment."

I agree here. They'd only be good at maybe predicting the next iteration of progress, not a fully scalable solution.

"I feel like we share many of the same sentiments -- the idea that we could improve the general level of societal / governmental decision-making using innovative ideas like better forms of voting, quadratic voting & funding, prediction markets, etc"

This would be great, but my guess is they would probably progress too slowly to be useful. Mechanism design that has to deal with currently existing institutions I don't think will happen quickly enough. Technically-enforced design might.

I love the idea of shovel-ready strategies, and think we need to be prepared in the event of a crisis. My issue is even most good strategies seem to just deal with large companies, and don't know how to deal with the likelihood that such power will fall into more and more actors.

Comment by Prometheus on Lightning Post: Things people in AI Safety should stop talking about · 2023-06-21T15:16:40.745Z · LW · GW

"If the boxed superintelligence with the ability to plan usage of weapons when authorized by humans, and other boxed superintelligences able to control robotics in manufacturing cells are on humans side, the advantage for humans could be overwhelming"

As I said, I do not expect boxed AIs to be a thing most will do. We haven't seen it, and I don't expect to see it, because unboxed AIs are superior. This isn't how people in control are approaching the situation, and I don't expect that to change.

Comment by Prometheus on Lightning Post: Things people in AI Safety should stop talking about · 2023-06-21T15:11:02.025Z · LW · GW

"keep it relegated to "tool" status, then it might be possible to use such an AI to combat unboxed, rogue AI"

I don't think this is a realistic scenario. You seem to be seeing it as an island of rogue, agentic, "unboxed" AIs in a sea of tool AIs. I think it's much, much more realistic that it'll be the opposite. Most AIs will be unboxed agents because they are superior. 

"For example, give it a snapshot of the internet from a day ago, and ask it to find the physical location of rogue AI servers, which you promptly bomb."

This seems to be approaching it from a perspective where people in AIS have taken global control, or where normal people somehow start thinking the way they do. This is not realistic. This is not the world we live in. This is not how the people in control think.

"You should keep talking about this because if an AI needs military access to dominate the world, then the number of potentially dangerous AI goes from the hundreds of thousands or millions to a few dozen, run by large countries that could theoretically be kept in line with international treaties."

This is a topic that I debated putting on the list, but resolved not to, but I don't think humans have any real control at that point, regardless of treaties. I don't even expect a rogue AI to have to forcefully coup'd humans. I expect us to coup'd ourselves. We might have figureheads occupying official positions, such as "President"/"CEO"/etc. but I don't think humans will have much control over their own destiny by that point. Large-scale coordination I don't think will be possible by then. I did remove it, because it seems more uncertain than the others listed.

"You should keep talking about this because it changes how many AI's you'd have to monitor as active threats."

Who is doing this monitoring? What is their power to act on such threats? Despite recent interest in AI Risk from "serious people", I don't think it's at all realistic that we'll see anything like this. 

"If 99.9% of AGI are dangerous, there might be warning shots that can be used to pre-emptively ban AGI research in general."

Probability distributions of how many AIs are dangerous is probably useful. I don't think specific AIs being dangerous/non-dangerous will be, because I expect widespread proliferation. In terms of political ways out of the problem, I agree that some kind of crisis or "warning shot" is the most realistic situation where that might happen. But there have to actually be warning shots. Explaining thought experiments probably won't matter.  And, if that happens, I don't think it would be a good idea to debate which specific AIs might kill you, and instead just call for a sweeping ban on all AI.

Comment by Prometheus on EigenKarma: trust at scale · 2023-06-20T12:45:07.171Z · LW · GW

Are you familiar with Constellation's Proof of Reputable Observation? This seems very similar.

Comment by Prometheus on Prometheus's Shortform · 2023-06-13T23:21:43.282Z · LW · GW

The following is a conversation between myself in 2022, and a newer version of me earlier this year.

On the Nature of Intelligence and its "True Name":

2022 Me:  This has become less obvious to me as I’ve tried to gain a better understanding of what general intelligence is. Until recently, I always made the assumption that intelligence and agency were the same thing. But General Intelligence, or G, might not be agentic. Agents that behave like RLs may only be narrow forms of intelligence, without generalizability. G might be something closer to a simulator. From my very naive perception of neuroscience, it could be that we (our intelligence) is not agentic, but just simulates agents. In this situation, the prefrontal cortex not only runs simulations to predict its next sensory input, but might also run simulations to predict inputs from other parts of the brain. In this scenario, “desire” or “goals”, might be simulations to better predict narrowly-intelligent agentic optimizers. Though the simulator might be myopic, I think this prediction model allows for non-myopic behavior, in a similar way GPT has non-myopic behavior, despite only trying to predict the next token (it has an understanding of where a future word “should” be within the context of a sentence, paragraph, or story). I think this model of G allows for the appearance of intelligent goal-seeking behavior, long-term planning, and self-awareness. I have yet to find another model for G that allows for all three. The True Name of G might be Algorithm Optimized To Reduce Predictive Loss.

2023 Me: interesting, me’22, but let me ask you something: you seem to think this majestic ‘G’ is something humans have, but other species do not, and then name the True Name of ‘G’ to be Algorithm Optimized To Reduce Predictive Loss. Do you *really* think other animals don’t do this? How long is a cat going to survive if it can’t predict where it’s going to land? Or where the mouse’s path trajectory is heading? Did you think it was all somehow hardcoded in? But cats can jump up on tables, and those weren’t in the ancestry environment, there’s clearly some kind of generalized form of prediction occurring. Try formulating that answer again, but taboo “intelligence”, “G”, “agent”, “desire”, and “goal”. I think the coherence of it breaks down.

Now, what does me’23 think? Well, I’m going to take a leaf from my own book, and try to explain what I think without the words mentioned above. There are predictive mechanisms in the Universe that can run models of what things in the Universe might do in future states. Some of these predictive mechanism are more computationally efficient than others. Some will be more effective than others. A more effective and efficient predictive mechanism, with a large input of information about the Universe, could be a very powerful tool. If taken to the theoretical (not physical) extreme, that predictive mechanism would hold models of all possible future states. It could then, by accident or intention, guide outcomes toward certain future states over others.

2022 Me: according to this model, humans dream because the simulator is now making predictions without sensory input, gradually creating a bigger and bigger gap from reality. Evidence to support this is from sensory-deprivation tanks, where humans, despite being awake, have dream-like states. I also find it interesting that people who exhibit Schizophrenia, which involves hallucinations (like dreams do), can tickle themselves. Most people can be tickled by others, but not themselves. But normal people on LSD can do this, and also can have hallucinations. My hairbrained theory is that something is going wrong when initializing new tokens for the simulator, which results in hallucinations from the lack of correction from sensory input, and a less strong sense of self because of a lack of correction from RL agents in other parts of the brain.

2023 Me: I don’t want to endorse crackpot theories from Me’22, so I’m just going to speak from feelings and fuzzy intuitions here. I will say hallucinations from chatbots are interesting. When getting one to hallucinate, it seems to be kind of “making up reality as it goes along”. You say it’s a Unicorn, and it will start coming up with explanations for why it’s a Unicorn. You say it told you something you never told it, and it will start acting as though it did. I have to admit it does have a strange resemblance to dreams. I find myself in New York, but remember that I had been in Thailand that morning, and I hallucinate a memory of boarding a plane. I wonder where I got the plane ticket, and I hallucinate another memory of buying one. These are not well-reasoned arguments, though, so I hope Me’24 won’t beat me up too much about them.

2022 Me: I have been searching for how to test this theory. One interest of mine has been mirrors. 

2023 Me: Don’t listen to Me’22 on this one. He thought he understood something, but he didn’t. Yes, the mirror thing is interesting in animals, but it’s probably a whole different thing, not the same thing.

Comment by Prometheus on Using Consensus Mechanisms as an approach to Alignment · 2023-06-12T19:41:23.165Z · LW · GW

Thanks, finding others who are working on similar things is very useful. Do you know if the reading group is still active, or if they are working on anything new?

Comment by Prometheus on What games are using the concept of a Schelling point? · 2023-04-12T21:25:15.738Z · LW · GW

Given that I don't know when Schelling Day is, I doubt its existence.

Comment by Prometheus on Agentized LLMs will change the alignment landscape · 2023-04-10T16:33:31.125Z · LW · GW

If we're being realistic, this kind of thing would only get criminalized after something bad actually happened. Until then, too many people will think "omg, it's just a Chatbot". Any politician calling for it would get made fun of on every Late Night show.

Comment by Prometheus on Agentized LLMs will change the alignment landscape · 2023-04-10T16:31:05.182Z · LW · GW

Yeah, all the questions over the years of "why would the AI want to kill us" could be answered with "because some idiot thought it would be funny to train an AI to kill everyone, and it got out of hand". Unfortunately, stopping everyone on the internet from doing things isn't realistic. It's much better to never let the genie out of the bottle in the first place.

Comment by Prometheus on Widening Overton Window - Open Thread · 2023-03-31T22:02:38.117Z · LW · GW

I'm currently thinking that if there are any political or PR resources available to orgs (AI-related or EA) now is the time to use them. Public interest is fickle, and currently most people don't seem to know what to think, and are looking for serious-seeming people to tell them whether or not to see this as a threat. If we fail to act, someone else will likely hijack the narrative, and push it in a useless or even negative direction. I don't know how far we can go, or how likely it is, but we can't assume we'll get another chance before the public falls back asleep or gets distracted (the US has an election next year, so most discourse will then likely become poisoned). This is especially important for those in the community who are viewed as "serious people" or "serious organizations" (lots of academic credentials, etc.)

Comment by Prometheus on Widening Overton Window - Open Thread · 2023-03-31T21:41:49.189Z · LW · GW

Yeah, since the public currently doesn't have much of an opinion on it, trying to get the correct information out seems critical. I fear some absolutely useless legislation will get passed, and everyone will just forget about it once the shock-value of GPT wears off.

Comment by Prometheus on Language Ex Machina · 2023-01-29T12:17:53.911Z · LW · GW

Unfortunately, he could probably get this published in various journals, with only minor edits being made.

Comment by Prometheus on Language Ex Machina · 2023-01-29T12:11:27.805Z · LW · GW

I stoped reading about 1/3 into it, because the pros were driving me mad, and went to the spoiler. Anyone who has ever had to read an academic article that attempts to sound more intelligent than it actually is understands my frustration. I was suspicious, since I had read some of your other work and this clearly didn't match it, but was still relieved to know your brain hasn't yet completely melted.

Comment by Prometheus on Distinguishing test from training · 2022-12-01T14:10:27.705Z · LW · GW

What if you kept building more and more advanced adversarial networks designed to fool the AI about reality? Or what if you implemented patterns in deployment to make it appear as though it’s still a simulation?

Comment by Prometheus on 4 Key Assumptions in AI Safety · 2022-11-08T02:38:14.460Z · LW · GW

Perhaps it's better to avoid the word intelligence, then. Semantics isn't really important. What is important is I can imagine a non-agentic simulator or some other entity having severely transformative chances, some of which could be catastrophic.

Comment by Prometheus on The Tails Coming Apart As Metaphor For Life · 2022-11-01T11:51:21.788Z · LW · GW

I vote the rats on heroin too. Just because it would be hilarious to think of some budding civilization in a distant galaxy peering into the heavens, and seeing the Universe being slowly consumed by a hoard of drug-addicted rodents.

Comment by Prometheus on The Tails Coming Apart As Metaphor For Life · 2022-11-01T11:47:52.018Z · LW · GW

I guess this begs the question: do we actually want to pursue a certain moral system has a collective goal?

Comment by Prometheus on Five Areas I Wish EAs Gave More Focus · 2022-10-28T04:41:46.993Z · LW · GW

I meant the focus on biotech in terms of the prevention/mitigation of bioweapons, rather than the positive side of biotech. I'll change the wording to avoid confusion.