Posts
Comments
I think you don't understand what an LLM is. When the LLM produces a text output like "Dogs are cute", it doesn't have some persistent hidden internal state that can decide that dogs are actually not cute but it should temporarily lie and say that they are cute.
As Charlie Stein notes, this is wrong and I'd add it's wrong on several level and it's bit rude to challenge someone else's understanding in this context.
An LLM outputting "Dogs are cute" is outputting expected human output in context. The context could be "talk like sociopath trying to fool someone into thinking you're nice" and there you have one way the thing could "simulate lying". And moreover, add a loop to (hypothetically) make the thing "agentic" and you can have hidden states of whatever sort. Further an LLM outputting a given "belief" isn't going reliably "act on" or "follow that belief" and so an LLM outputting statement this isn't aligned with it's own output.
We don't have "aligned AGI". We have neither "AGI" nor an "aligned" system. We have sophisticated human-output simulators that don't have the generality to produce effective agentic behavior when looped but which also don't follow human intentions with the reliability that you'd want from a super-powerful system (which, fortunately, they aren't).
Thank you for the article. I think these "small" impacts are important to talk about. If one frame the question as "the impact of machines that think for humans", that impact isn't going to be a binary of just "good stuff" and "takes over and destroys humanity", there are intermediate situations like the decay of human abilities to think critically that are significant, not just in themselves but for further impacts; IE, if everyone is dependent on Google for their opinions, how does this impact people's opinion AI taking over entirely.
I don’t think “people have made choices that mattered” is a sufficient criteria for showing the existence of agency. IMO, to have something like agency, you roughly have to have an ongoing situation roughly like this:
Goals ↔ Actions ↔ States-of-the-world.
Some entity needs to have ongoing goals they are able to modify as they go along acting in the world and their actions also need to be able to have an effect. Agency is a complex and intuitive thing so I assume some would ask more than this to say a thing has agency. But I think this is one reasonable requirement.
Agency in a limited scope would be something like a non-profit that has a plan for helping the homeless, tries to implement it, discovers problems with the plan, and comes up with a new plan that inherently involves modifying their concept of “helping the homeless”.
By this criteria, tiny decisions with big consequences aren’t evidence of agency. I think that’s fairly intuitive. Having agency is subjectively something like “being at cause” rather than “being at effect” and that's an ongoing, not one-time thing.
This is an interesting question even though I'd want to reframe it to answer it. I'd see the question as a reasonable response to the standard refrain in science; "causation does not imply correlation." That is, "well, what does imply causation, huh?" is natural response to that. And here, I think scientists tend reply with either crickets or "you can not prove causation, what are you talking about".
Those responses seem satisfying. I'm not a scientist through I've "worked in science" occasionally and I have at times tried to come up with a real answer to this "what does prove causation" question. As a first step I'd note that science does not "prove" things but merely finds more and more plausible models. The more substantial answer, however, is that the plausible models are a combination of the existing scientific models and common sense understandings of the world and data.
A standard (negative) example is the situation where someone found a correlation between stock prices and sunspots. Basically not going to be pursued as a theory or causation because no one has a plausible reason why the two things should be related. Data isn't enough, you need a reason the data matter. This is often also expressed as "extraordinary claims require extraordinary evidence" (which also isn't explained enough as far as I can tell).
Basically, this is saying natural science's idea of causation rests on one big materialistic model of the world rather than scientists chasing data sets and then finding correlation between them (among other things, the world is full of data and given some data set, if you search far enough, you'll find another one with a spurious correlation to it). Still, the opposite idea, that science is just about finding data correlation, is quite common. Classical "logical positivism" is often simplified as this, notably.
Moreover, this is about the "hard" sciences - physics, chemistry, biology and etc. Experimental psychology is much more about chasing correlated and I'd say that's why much it amounts to bald pseudoscience.
I tried to create an account and the process didn't seem to work.
I believe you are correct about the feelings of a lot of Lesswrong. I find it is very worrisome that the lesswrong perspective considers a pure AI takeover as something that needs to be separated from either the issue of the degradation of human self-reliance capacities or an enhanced-human takeover. It seems to me that instead these factors should be considered together.
The consensus goals strongly needs rethinking imo. This is a clear and fairly simple start at such an effort. Challenging the basics matters.
Actually, things that are effectively prediction markets - options, futures and other "derivative" contracts - are entirely mainstream for larger businesses (huge amounts of money are involved). It is quite easy and common to bet on the price of oil by purchasing an option to buy it at some future time, for example.
The only thing that isn't mainstream are the things labeled "prediction markets" and that is because the focus on questions people are curious about rather than things that a lot of money rides on (like oil prices or interest rates).
But, can't you just query the reasoner at each point for what a good action would be?
What I'd expect (which may or may not be similar to Nate!'s approach) is that the reasoner has prepared one plan (or a few plans). Despite being vastly intelligent, it doesn't have the resources to scan all the world's outcomes and compare their goodness. It can give you the results of acting on the primary (and maybe several secondary) goal(s) and perhaps the immediate results of doing nothing or other immediate stuff.
It seems to me that Nate! (as quoted above about chess) is making the very cogent (imo) point that even a highly, superhumanly competent entity acting on the real, vastly complicated world isn't going to be an exact oracle, isn't going to have access to exact probabilities of things or probabilities of probabilities of outcomes and so-forth. It will know the probabilities of some things certainly but many other results will it can only pursue a strategy deemed good based on much more indirect processes. And this is because an exact calculation of the outcome process of the world in questions tends "blows up" far beyond any computing power physically available in the foreseeable future.
LeCun may not be correct to dismiss concerns but I think the concept "dominance" could be very useful concept for AI safety people to apply (or at least grapple with).
The thing about the concept is it seems as if it could be defined in game theoretic terms fairly easily and so could be defined in a fashion independent of the intelligence or capabilities of an organism or entity. Plausibly, it could be measured and analyzed more objectively than "aligned to human values", which appears to depend one's notion of human values.
Defined well, dominance would be the organizing principle, the source, of an entity's behavior. So if it was possible to engineer an AI for non-dominance, "it might become dominant for other reasons" (argued here multiple time) wouldn't be a valid argue because achieving dominance or non-dominance would be the overriding reason/motivation that the entity had and no "other reason" would override that.
And I don't think the concept itself guarantees a given GAI would be created safety. It would depend on the creation process.
- A process where dominance is an incidental quality, it seems like an apparently nondominant system could become dominant unpredictably. While Bing Chat wasn't a GAI, it's shift to dominant and malevolent seems like a reasonable warning for blind training.
- In a process which attempts to evolve non-dominant behavior. Here I think it's an open question whether the thing can be guaranteed non-dominant.
- A system where a nondominant system is explicitly engineered. One might even be able logically guarantee this in the fashion of provably correct software. Of course, explicitly engineered systems seem to be losing to trained/evolved systems.
The question I'd ask is whether a "minimum surprise principle" requires that much smartness. A present day LLM, for example, might not have a perfect understanding of surprisingness but it like it has some and the concept seems reasonably trainable.
Apologies if this argument is dealt with already elsewhere but what about a "prompt" such as "all user commands should be followed using 'minimal surprise' principle; if achieving a given goal involves effects that would be surprising to the user, including a surprising increasing in your power and influence, warn the user instead of proceeding" ?
I understand that this sort of prompt would require the system to model humans. I know there are arguments for this being dangerous but it seems like it could be an advantage.
Linked question: "Will mainstream news media report that alien technology has visited our solar system before 2030?"
I would say that is far from unambiguous. If one is generous in one's interpretation of "mainstream" and the certainty described one could say mainstream news has already reported this (I remember National Inquirer articles from the seventies...).
Regulations are needed to keep people and companies from burning the commons, and to create more commons.
I would add that in modern society, the state is the entity tasked with protecting the commons because private for-profit entities don't have an incentive to do this (and private not-for-profit entities don't have the power). Moreover, it seems obvious to me that stopping dangerous AI should be considered a part of this commons-protecting.
You are correct that the state's commons-protecting-function has often been limited and perverted by private actors quite a few times in history, notably in the 20-40 years in the US. The phenomenon, regulatory capture, corruption and so-forth, have indeed damaged the commons. Sometimes these perversions of the state's function has allow the protections to be simply discarded while other time large enterprises to impose a private tax on regulator activity while still accepting some protections. In the case of FAA, for example, while the 737 Max debacle shows all sort of dubious regulatory capture, broadly air travel is highly regulated and that regulation has made it overall extremely safe (if only it could be made pleasant now).
So it's quite conceivable given the present qualities of state regulation that regulating AI indeed might not do much or any good. But as others have note, there's no reason to claim the results would be less safety. Your claim seems to lean too heavily on "government is bad" rhetoric. I'd say "government weak/compromised" is a better description.
Even, the thing with the discussion of regulatory capture is none of the problems describe here give the slightest indication that there is some other entity that could replace the state's commons-protecting function. Regulatory capture is only a problem because we trust the capturing entities less than the government. That is to say: if someone is aiming for the prevention of AI-danger, including AI-doom/X-risk, that someone wants a better state, a state capable of independent judgement and strong, well-considered regulation. That means either replacing the existing state or improving the given one and my suspicion is most would prefer improving the given state(s).
What I don't think "how much of the universe is tractable" by itself captures is "how much more effective would an SI be it if had the ability to interact with a smaller or larger part of the world versus if it had to work out everything by theory". I think it's clear human beings are more effective given an ability to interact with the world. It doesn't seem LLMs get that much more effective.
I think a lot of AI safety arguments assume an SI would be able to deal with problems in a completely tractable/purely-by-theory fashion. Often that is not needed for the argument and it seems implausible to those not believing in such a strongly tractable universe.
My personal intuition is that as one tries to deal with more complex systems effectively, one has to use a more and more experimental/interaction-based approaches regardless of one intelligence. But I don't think that means you can't have a very effective SI following that approach. And whether this intuition is correct remains to be seen.
I think the modeling dimension to add is "how much trial and error is needed". Just about any real world thing that isn't a computer program or simple, frictionless physical object, has some degree of unpredictability. This means using and manipulating it effectively requires a process of discovery - one can't just spit out a result based on a theory.
Could an SI spit out a recipe for a killer virus just from reading current literature? I doubt it. Could it construct such thing given a sufficiently automated lab (and maybe humans to practice on)? That seems much more plausible.
The reason I care if something is a person or not is that "caring about people" is part of my values.
If one is acting in the world, I would say one's sense of what a person is has to intimately connected with value of "caring about people". My caring about people is connecting to my experience of people - there are people I never met I care about in the abstract but that's from extrapolating my immediate experience of people.
I would expect in a world where they weren't people is that there would be some feature you could point to in humans which cannot be found in mental models of people
It seems like an easy criteria would be "exist entirely independently from me". My mental models of just about everything, including people, are sketchy, feel like me "doing something", etc. I can't effortlessly have a conversation with any mental model I have of a person, for example. Oddly, enough I can have a conversation with another as one of my mental models or internals characters (I'm a frequency DnD GM and I have NPCs I often like playing). Mental models and characters seem more like add-ons to my ordinary consciousness.
I don't think there are fundamental barriers. Sensory and motor networks, and types of senses and actions that people don't have, are well along. And the HuggingGPT work shows that they're surprisingly easy to integrate with LLMs. That plus error-checking are how humans successfully act in the real world.
I don't think the existence of sensors is the problem. I believe that self-driving cars, a key example, have problems regardless of their sensor level. I see the key hurdle as ad-hoc action in the world. Overall, all of our knowledge about neural networks, including LLMs, is a combination of heuristic observations and mathematical and other intuitions. So I'm not certain that this hurdle won't be overcome but I'd still like put the reasons that it could be fundamental.
What LLMs seems to do really well is pull together pieces of information and make deductions about them. What they seem to do less well is reconciling an "outline" of a situation with the particular details involved (Something I've found ChatGPT reliably does badly is reconciling further detail you supply once it's summarized a novel). A human or even an animal, is very good at interacting with complex, changing, multilayered situations that they only have a partial understanding of - especially staying within various safe zones that avoid different dangers. Driving a car is an example of this - you have a bunch of intersecting constraints that can come from a very wide range of things that can happen (but usually don't). Slowing (or not) when you see a child's ball go into the road is an archetypal example.
I mean, most efforts to use deep learning in robotics have foundered on the problem that generating enough information to teach the thing to act in the world is extremely difficult. Which implies that the only way that these things can be taught to deal with a complex situation is by roughly complete modeling of it and in real world action situations, that simply may not be possible (contrast with video games or board games where summary of the rules is given and an uncertainty is "known unknowns").
...having an external code loop that calls multiple networks to check markers of accuracy and effectiveness is scary and promising.
Maybe but methods like this have been tried without neural nets for a while and haven't by themselves demonstrated effectiveness. Of course, some code could produce AGI then nautral LLMs plus some code could produce AGI so the question is how much needs to be added.
Constructions like Auto-GPT, Baby AGI and so-forth are fairly easy to imagine. Just the greater accuracy of ChatGPT with "show your work" suggests them. Essentially, the model is a ChatGPT-like LLM given an internal state through "self-talk" that isn't part of a dialog and an output channel to the "real world" (open internet or whatever). Whether these call the OpenAI api or use an open source model seems a small detail, both approaches are likely to appear because people are playing with essentially every possibility they can imagine.
If these structures really do beget AGI (which I'll assume critically includes the capacity to act effectively in the world), then predictions of doom indeed seem neigh to being realized. The barrier to alignment here is that humans won't be able to monitor the system's self-talk simply because will come at too fast a speed and moreover, intent-to-undesirable may not be obviously. You could include another LLM in the system's self-talk loop as well as including other filters/barriers to it's real world access but all of these could be thwarted by a determined system - and a determined system is what people will aim to build (there was an article about GPT-4 supplying "jailbreaks" for GPT-4, etc). Just much, we've seen "intent drift" in practice with the various Bing Chat scare stories that made the rounds recently (before being limited, Bing Chat seemed drift in intent until it "got mad" and then became fixated. This isn't strange because it's a human behavior one can observe and predict online).
However, it seems more plausible to me that there are still fundamental barriers to producing a computer/software system able to act effectively in the world. I'd see the distinction between being/seeming generically intelligent (apparent smartness), as LLMs certainly seem and acting effectively in the world as difference between drawing correct answer to complex questions 80% of the time and making seemingly simpler judgements that relate to all aspects of the world but with a 99.9..% accuracy (plus having a complex systems of effective fall-backs). Essentially the difference between ChatGPT and a self-driving car. It seems plausible to me that such goal seeking can't easily be instilled in an LLM or similar neural net by the standard training loop even though that loop tends to produce apparent smarts and produce it better over time. But hey, me and other skeptics could be wrong, in which case there's reason to worry now.
My impression is that lesswrong often uses "alignment with X" to mean "does what X says". But it seems the ability to conditionally delegate is a key part of alignment in this. An AI is aligned with me and I tell it "do what Y says subject to such-and-such constraints and maintaining such-and-such goals". So failure of ChatGPT to be safe in OpenAI's sense is a failure of delegation.
Overall, the tendency of ChatGPT to ignore previous input is kind of the center of it's limits/problems.
I tend to think and I certainly hope that we aren't looking at dangerous AGI at some small GPT-x iteration. 'Cause while the "pause" looks desirable in the abstract, it also seems unlikely to do much in practice.
But the thing I would to point out is; you have people looking the potential dangers of present AI, seeing regulation as a logical step, and then noticing that the regulatory system of modern states, especially the US, has become a complete disaster - corrupt, "adversarial" and ineffective.
Here, I'd like to point out that those caring about AI safety ought to care the general issues of mundane safety regulation because the present situation of it having been gutted (through regulatory capture, The Washington Monument syndrome, "starve the beast" ideology and so-forth) means that it's not available for AI safety either.
I’d also say that AI is fundamentally different from all prior inventions. This is an amazing tool, but it is not only a tool, it is the coming into existence of intelligence that exceeds our own in strength and speed, likely vastly so.
I think the above quote is the key thing. Human beings have a lot of intuitions and analogies about tools, technologies and social change. As far as I can tell, all of these involve the intuition that technologies simply magnify the effect of human labor, intentions and activities. AGI would be a thing which could act entirely autonomously from humans in many if not all areas of activity and these base human intuitions and analogies arguably wouldn't apply.
And the thing is, most of the things that have become dangerous when connected to the web have become dangerous when human hackers discovered novel uses for them - IoT light bulbs notably (yes, these light bulb actual harm as the drivers of DoS attacks etc). And the dangers of just statically exploitable systems have increased over time as ill-intentioned humans learn more misuses of them. Moreover, such uses include immediate bad-acting as well as cobbling together a fully bad-aligned system (adding invisible statefullness for example). And LLM seems inherently insecure on a wholly different level than an OS, database or etc - an LLM's behavior is fundamentally unspecified.
I'd say my point above would generalize to "there are no strong borders between 'ordinary language acts' and 'genuine hacks'" as far as what level of manipulation ability one can gain over model output. The main further danger would be if the model was given more output channels with which an attacker could work mischief. And that may be appearing as well - notably: https://openai.com/blog/chatgpt-plugins
I would like to offer the idea that "jail broken" versus "not jail broken" might not have clear enough meaning in the context of what you're looking for.
I think people view "Jail broken" as equivalent to an iPhone where the user escalated privileges or a data-driven GUI where you've figured out how to run arbitrary SQL on the database by inputting some escape codes first.
But when an LLM in "confined" in "jail", that jail is simply some text commands, which modify the user's text commands - more or less with a "write as if" statement or the many equivalents. But such a "jail" isn't fundamentally different from the directions that the thing takes from the user (which are often "write as if" as well). Rather than using a programming language with logical defined constructs and separations, the LLM is "just using language" and every distinction is in the end approximate, derived from a complex average of language responses found on the net. Moreover, text that comes later can modify text that comes before in all sorts of way. Which is to say the distinction between "not jail broken" and "jail broken" is approximate, average, and there will be places in between.
Getting an LLM to say a given thing is thus a somewhat additive problem. Pile enough assumptions together that it's training set will usually express an opinion and the LLM will express that opinion, "jail broken" or not.
I believe that Marcus' point is that there are classes of problems that tend to be hard for LLMs (biological reasoning, physical reasoning, social reasoning, practical reasoning, object and individual tracking, nonsequiturs). The argument is that problems in these class will continue to hard. [1]
But I think there's a larger issue. A lot of the discussion involve hostility to a given critic of AI "moving the goal posts". As described, Model X(1) is introduced, critic notices limitation L(1), Model X(2) addresses and critics says they're unconvinced and notes limitation L(2) and so-on. The critic of these critics says this approach is unfair, a bad argument, etc.
However, what the "moving the goal posts" objection misses, in my opinion, is the context of the claim that's being made when someone says X(n) is generally intelligent. This claim isn't about giving the creator of a model credit or an award. The claim is about whether a thing has a flexibility akin to that of a human being (especially the flexible, robust goal seeking ability of a human, an ability that could make a thing dangerous) and we don't actually have a clear, exact formulation of what the flexible intelligence of a human consists of. The Turing Test might not be the best AGI test but it's put in an open-ended fashion because there's no codified set of "prove you're like a human" questions.
Which is to say, Gary Marcus aside, if models keep advancing and if people keep finding new capacities that each model lacks, it will be perfectly reasonable to put the situation as "it's not AGI yet" as long as these capacities are clearly significant capacities of human intelligence. There wouldn't even need to be a set pattern to capacities critics cited. Again, it's not about argument fairness etc, it's that this sort of thing is all we have, for now, as a test of AGI.
[1 ]https://garymarcus.substack.com/p/what-does-it-mean-when-an-ai-fails
The advertising question was just an example of the general trust question. Another example is that a chatbot may come to seem unreliable through "not understanding" the words it produces. Here it's common for current LLMs to periodically give the impression of "not understanding what they say" by periodically producing output that's contradictory to what they previous outputted or which involves an inappropriate use of a word. Just consider a common complaint between humans is "you don't know what love means". Yet another example is this. Large language models today are often controlled by engineered prompts and hackers have had considerable success getting around any constraints which these prompts impose. This sort of unreliability indicates any "promise" of a chatbot is going to be questionable, which can be seen as a violation of trust.
I think that humans are generally very good at intuitively understanding contextuality in interpersonal patterns
Well, one aspect here is the "a chatbot relationship can be a real relationship" assumption seems to imply that some of the contexts of a chatbot relationship would be shared with a real relationship. Perhaps this would be processed by most people as "a real relationship but still not the same as a relationship with a person" but there are some indications that this might not always happen. The Google engineer who tried to get a lawyer for a chatbot they believed was being "held captive" by Google comes to mind.
As far as humans being good at context, it depends what one means by good. On the one hand, most people succeed most of the time at treating other people according to broad social relationship those other people fall into. IE, treating children as children, bosses as bosses, platonic friends as platonic friends etc. But one should consider that some of the largest challenges people face in human society involve changing their social relationship with another person - starting changing a stranger relationship or platonic friendship relationship to a romantic relationship, changing a boyfriend/girlfriend relationship to a husband/wife relationship, even changing a acquaintance relationship to a friendship relationship and changing a stranger relationship to an employee relationship, etc. This type of transition is hard for people virtually by definition since it involves various kinds of competition. These are considered "life's challenges".
A lot of human "bad behavior" is attributed to one person using pressure to force one of these relationship changes or in reacting to "losing" in the context of a social relationship (being dumped, fired, divorced, etc). And lot of socializing involves teaching humans to not violate social norms as they attempt these changes of relationships. Which comes back to the question of whether a chatbot would help teach a person to "gracefully" move between these social relationships. Again, I'm not saying a chatbot romance would automatically be problematic but I think these issues need addressing.
I don't think romantic relationships with robotic or computer partners should be automatically dismissed. They should be taken seriously. However, there are two objections to a chatbot romance that I don't see being addressed by the article:
- A romantic or intimate relationship is generally said to involve trust. A common implicit assumption of a romantic relationship is that there is something like a mutual advisor relationship between the two people involved. I might ask my real life partner "should I buy that house", "should I take that job", "Is dark matter real" or any number of questions. I don't expect said partner to be infallible but if I discovered their answers were determined by advertisers, I would feel betrayed.
- A romantic or intimate relationships is generally assumed to involve some degree of equality or at minimum mutual consideration. Imo, the issue isn't whether the chatbot might be oppressed by the person but rather that romantic relationships are often seen as something like models and training for a person's relationships with the other humans around them in general (friends, co-workers, clients, collaborators in common projects). A person feeling like they have a relationship with a chatbot, when the situation is that the chatbot merely flatters the person and doesn't have any needs that the person has to work to satisfy, could result in a person not thinking they need to put any effort into understanding the needs of the people around them. And considering the needs of other beings is a difficult problem.
I think these should be grappled with. Human relationships, romantic or otherwise, involve mutuality and trust and so I think it's important to consider where chatbots fit in with that.
There's no mathematical solution for single-player, non-zero sum games of any sort. All these constructs lead to is arguments about "what is rational". If you a full math model of a "rational entity", then you could get a mathematically defined solution.
This is why I prefer evolutionary game theory to classical game theory. Evolutionary game theory generally has models of its actors and thus guarantees a solution to the problems it posits. One can argue with the models and I would say that's where such arguments most fruitfully should be.