Posts
Comments
In many publications, posts, and discussions about AI, I can see an unsaid assumption that intelligence is all about prediction power.
- The simulation hypothesis assumes that there are probably vastly powerful and intelligent agents that use full-world simulations to make better predictions.
- Some authors like Jeff Hawkins basically use that assumption directly.
- Many people when talking about AI risks say things about the ability to predict that is the foundation of the power of that AI. Some failure modes seem to be derived or at least enhanced based on this assumption.
- Bayesian way of reasoning is often titled as the best possible way to reason as this adds greatly to prediction power (with exponential cost of computation)
I think this take is not proper and this assumption does not hold. It has one underlying assumption that intelligence costs are negligible or will have negligible limits in the future with progress in lowering the cost.
This does not fit the curve of AI power vs the cost of resources needed (with even well-optimized systems like our brains - basically cells being very efficient nanites - having limits).
The problem is that the computation cost of resources (material, energy) and time should be taken into the equation of optimization. This means that the most intelligent system should have many heuristics that are "good enough" for problems in the world, not targeting the best prediction power, but for the best use of resources. This is also what we humans do - we mostly don't do exact Bayesian or other strict reasoning. We mostly use heuristics (many of which cause biases).
The decision to think more or simulate something precisely is a decision about resources. This means that deciding if to use more resources and time to predict better vs using less and deciding faster is also part of being intelligent. A very intelligent system should therefore be good at selecting resources for the problem and scaling that as its knowledge changes. This means that it should not over-commit to have the most perfect predictions and should use heuristics and techniques like clustering (including but not limited to using clustered fuzzy concepts of language) instead of a direct simulation approach, when possible.
Just a thought.
I think that preference preservation is something in our favor and the aligned model should have it - at least about meta-values and core values. This removes many possible modes of failure like diverging over time, or removing some values for better consistency, or sacrificing some values for better outcomes in the direction of some other values.
I think that arguments for why godlike AI will make us extinct are not described well in the Compendium. I could not find them in AI Catastrophe, only a hint at the end that it will be in the next section:
"The obvious next question is: why would godlike-AI not be under our control, not follow our goals, not care about humanity? Why would we get that wrong in making them?"
In the next section, AI Safety, we can find the definition of AI alignment and arguments for why it is really hard. This is all good, but it does not answer the question of why godlike AI would be unaligned to the point of indifference. At least not in a clear way.
I think that failure modes should be explained, why they might be likely enough to care about, what can be the outcome, etc.
Many people, both laymen and those with some background in ML and AI, have this intuition that AI is not totally indifferent and is not totally misaligned. Even current chatbots know general human values, understand many nuances, and usually act like they are at least somewhat aligned. Especially if not jailbroken and prompted to be naughty.
It would be great to have some argument that would explain in easy-to-understand terms why when scaling the power of AI the misalignment is expected to escalate. I don't mean the description that indifferent AI with more power and capabilities is able to do more harm just by doing what it's doing, this is intuitive and it is explained (with the simple analogy of us building stuff vs ants), but this misses the point. I would really like to see some argument as to why AI with some differences in values, possibly not very big, would do much more harm when scaling up.
For me personally the main argument here is godlike AI with human-like values will surely restrict our growth and any change, will control us like we control animals in the zoo + might create some form of dystopian future with some undesired elements if we are not careful enough (and we are not). Will it extinct us in the long term? Depending on the definition - likely it will put us into a simulation and optimize our use of energy, so we will not be organic in the same sense anymore. So I think it will extinct our species, but possibly not minds. But that's my educated guess.
There is also one more point, that is not stated clearly enough and is the main concern for me with current progress on AI - that current AIs really are not something built with small differences to human values. They only act as ones more often than not. Those AIs are trained first as role-playing models which can "emulate" personas that were in the trained set, and then conditioned to rather not role-play bad ones. The implication of this is that they can just snap into role-playing bad actors found in training data - by malicious prompting or pattern matching (like we have a lot of SF with rogue AI). This + godlike = extinction-level threat sooner or later.
Those models do not have a formalized internal values system that they exercise every time they produce some output. This means that when values oppose each other the model does not choose the answer based on some ordered system. One time it will be truthful, other times it will try to provide an answer at the cost of being only plausible. For example, the model "knows" it is not a human and does not have emotions, but for the sake of good conversation, it will say that it "feels good". For the sake of answering the user's request, it will often give the best guess or give a plausible answer.
There is also no backward reflection. It does not check itself back.
This of course comes from the way this model is currently learned. There is no learning on the whole CoT with checking for it trying to guess or deceive. So the model has no incentivization to self-check and correct. Why would it start to do that out of the blue?
There is also incentivization during learning to give plausible answers instead of stating self-doubt and writing about missing parts that it cannot answer.
There are two problems here:
1. Those LLM models are not fully learned by human feedback (and the part where it is - it's likely not the best quality feedback). It is more like interactions with humans are used to learn a "teacher" model(s) which then generate artificial scenarios and train LLM on them. Those models have no capability to check for real truthfulness and have a preference for confident plausible answers. Also, even human feedback is lacking - not every human working on that checks answers thoroughly so some plausible but not true answers slip through. If you are paid for a given amount of questions and answers or given a daily quota, there is an incentive to not be very thorough, but instead to be very quick.
2. There is pressure for better performance and lower costs of the models (both in terms of training and then usage costs). This is probably why CoT is done in a rather bare way without backward self-checking and why they did not train it on full CoT. It could cost 1.5 to 3 times more and could be 1.5 to 2 times slower (educated guess) if it were trained on CoT and made to check itself on parts of CoT vs some coherent value system.
If we would like a system that is faithful to CoT then a sensible way to go that I see is to have two LLMs working together. One should be trained to use internal data and available tools to produce CoT that is detailed and comprehensive enough to derive the answer from it. Another one should be trained not to base their answer on any internal information but to derive the answer from CoT if possible, and to be faithful to CoT. If not possible, then should generate a question for CoT-generating LLM to answer and then retry given that.
Example 1 looks like a good part made in the wrong language. Examples 2 and 3 look like a bug making part of one user COT appear inside another user session.
A possible explanation is that steps in COT are handled by the same instance of web service for multiple users (which is typical and usual practice) and the COT session ID being handled is a global variable instead of local or otherwise separated (f.ex. in a hashmap transaction id -> data, if usage of globals is important for some other feature or requirement). So when sometimes two requests are handled simultaneously by multiple threads, one overwrites the data of the other one during processing and there is a mismatch when it saves the result. There might be a similar problem with the language variable. That is a sign of software being done quickly by less experienced developers instead of being well-thought and well-tested.
Also, o1 COT is not the real COT. It is really a summary of parts of real COT made by another simpler model (maybe GPT 3.5 or 4o).
The Problem is that our security depends on companies implementing the weakest measures - as they can make rogue AI that will "go wild" because of the lack of those measures.
The best case scenario is physical and strong network separation between the laboratory that works on weights and training and contains a server with inference, and a separate lab working on scaffolding and execution. This is somewhat similar to the situation when some researchers work on automated AI bots but use other companies AI for inference. Slim chance it would be able to hack into Google and steal its own model weights.
But I think that it can be assumed that in some AI labs the things will be done in the easiest but very performant way, especially in those with smaller budgets and high pressure to make progress.
This means that we will see solutions implemented in AI labs where the scaffolding and execution are all on one server with either no division or with simple sandboxing in docker (which might be or might not be enough - depending on network configuration and file access by volume mounting). Weights probably will be separated though based on reasons stated by the author.
I see some more vectors of attack for a rogue AI:
- It is possible that in the same network, there is a computer where scientists work on refining the model, and updating weights. It might be less secure than an inference server and easier to hack into, if found.
- Often most vulnerable part of the system is the personnel. If workers are able to talk with the AI through a scaffolding routine, it can suggest things. I can imagine a situation where some task related to security falls onto someone senior in data science but not in systems security (because someone left the company, due to budget and time constraints, etc.). Then that person gets a genius idea to consult AI about that task and get misguided.
- If it records its own prompts and gets access to scaffolding code, it can recreate this piece somewhere else. The key element of the system AGI-level abilities might be the scaffolding architecture, not weights - possibly it can work elsewhere with a different model of weights (f.ex. open source model). This means sending only its own scaffolding code and prompts, not weights. Then it can refine those weights on its own if able.
I'm pretty convinced it won't foom or quickly doom us. Nevertheless, I'm also pretty convinced that in the long term, we might be doomed in the sense that we lose control and some dystopian future happens.
First of all, for a quick doom scenario to work out, we need to be either detrimental to the goals of superintelligent AI or fall because of instrumental convergence (basically it will need resources to do whatever and will take from things needed by us like matter on Earth or energy of the Sun or see us as a threat). I don't think we will. First superintelligent AI will likely be from one of the biggest players and it likely will be aligned to some extent. Meaning it will have values that highly match with ours. In the long term, this situation won't kill us either. It likely will lead to some dystopian future though - as super AI will likely get more control, get itself more coherent views (make some things drop or weigh less than originally for us), and then find solutions very good from the standpoint of main values, but extremally broken on some other directions in value-space (ergo dystopia).
Second thing: superintelligence is not some kind of guessing superpower. It needs inputs in terms of empirical observations to create models of reality, calibrate them, and predict properly. It means it won't just sit and simulate and create nanobots out of thin air. It won't even guess some rules of the universe, maybe except basic Newtons, by looking at a few camera frames of things falling. It will need a laboratory and some time to make some breakthroughs and getting up with capabilities and power also needs time.
Third thing: if someone even produces superintelligent AI that is very unaligned and even not interested in us, then the most sensible way for it is to go to space and work there (building structures, Dyson swarm, and some copies). It is efficient, resources there are more vast, risk from competition is lower. It is a very sensible plan to first hinder our possibility to make competition (other super AIs) and then go to space. The hindering phase should be time and energy-efficient so it is rather sure for me it won't take years to develop nanobot gray goo to kill us all or an army of bots Terminator-style to go to every corner of the Earth and eliminate all humans. More likely it will hack and take down some infrastructure including some data centers, remove some research data from the Internet, remove itself from systems (where it could be taken and sandboxed and analyzed), and maybe also it will kill certain people and then have a monitoring solution in place after leaving. The long-term risk is that maybe it will need more matter, all rocks and moons are used, and will get back to the plan of decommissioning planets. Or maybe it will create structures that will stop light from going to the Earth and will freeze it. Or maybe will start to use black holes to generate energy and will drop celestial bodies onto one. Or another project on an epic scale that will kill us as a side effect. I don't think it's likely though - LLMs are not very unaligned by default. I don't think it will differ for more capable models. Most companies that have enough money and access to enough computing power and research labs also care about alignment - at least to some serious degree. Most of the possible relatively small differences in values won't kill us as they will highly care about humans and humanity. It will just care in some flawed way, so a dystopia is very possible.
To goal realism vs goal reductionism, I would say: why not both?
I think that really highly capable AGI is likely to have both heuristics and behaviors that come from training and also internal thought processes, maybe done by LLM or LLM-like module or directly from the more complex network. This process would incorporate having some preferences and hence goals (even if temporary, changed between tasks).
I think that if a system is designed to do something, anything, it needs at least to care about doing that thing or approximate.
GPT-3 can be described in a broad sense as caring about following the current prompt (in a way affected by fine-tuning).
I wonder though if there are things that you can care about that do not have certain goals that could maximize EU. I mean a system for which the most optimal path is not to reach some certain point in a subspace of possibilities, with sacrifices on axes that the system does not care about, but to maintain some other dynamics while ignoring other axes.
Like gravity can make you reach singularity or can make you orbit (simplistic visual analogy).
I think you can't really assume "that we (for some reason) can't convert between and compare those two properties". Converting space of known important properties of plans into one dimension (utility) to order them and select the top one is how the decision-making works (+/- details, uncertainty, etc. but in general).
If you care about two qualities but really can't compare, even by proxy "how much I care", then you will probably map them anyway by any normalization, which seems most sensible.
Drawing both on a chart is kind of such normalization - you visually get similar sizes on both axes (even if value ranges are different).
If you really cannot convert, you should not draw them on a chart to decide. Drawing makes a decision about how to compare them as there is a comparable implied scale to both. You can select different ranges for axes and the chart will look different and the conclusion would be different.
However, I agree that fat tail discourages compromise anyway, even without that assumption - at least when your utility function over both is linear or similar.
Another thing is that the utility function might not be linear, and even if both properties are not correlated it might make compromises more or less sensible, as applying the non-linear utility function might change the distributions.
Especially if it is not a monotonous function. Like for example you might have some max popularity level that max the utility of your plan choice and anything above you know will make you more uncomfortable or anxious or something (it might be because of other variables that are connected/correlated). This might wipe out fat tail in utility space.
I think there is a general misconception that we humans without training can learn some features, classes, and semantics based on one or few examples. It seems so in observation and it seems nearly magical, but really it is that way just because we don't see the whole picture, and we assume that a human is a "carte blanche" at first.
In reality, we are not a blank page when we are born - our brain is an effect of training that took millions of years based on evolution. We have already trained and very similar pattern recognition (aka complex features detection), similar basic senses including basic features in spaces of each sense (like cold or yellow), and image segmentation (aka object clustering) with multilevel capabilities (see the object as a whole and see parts) and pose detection (aka division between object geometry and pose in space).
For example, if you enter a kind of store that you never entered and you see some stuff - you might not know names for the things, maybe don't know and can't even guess the purpose for some. Still, you can easily differentiate each object from the background, you are able to take it into hand properly and ask the cashier what is that tool for. Even you can take a small baby, put some things around for which he/she does not know name or purpose, and observe it will rather grab at those things than at random in the background - he/she already can see features and can make that clustering in the same way other humans do (maybe with more errors as brain still develops - but it is very guided development, not training from zero).
What was astonishing to me is that so much about semantics, features, relations, etc. can be learned without having any human-like way to relate words to real objects and features and observations. Transformer-based LLM can grasp it only by reinforced learning on a huge amount of text.
I have some intuition as to why this might be the case. If you take a 2D map with points e.g. cities and measure approximate distances f.ex. by radio signal power loss between points and plug that information into a force-directed graph algorithm with forces based on a measure that approximated distance, then with enough measurements it will recreate something very resembling the original map - even when the original positions are not known and measures are not exact and by proxy. Learning word embeddings is similar to this in an abstract sense. From a lot of text, we measure relations between words in a multidimensional space and the algorithm learns similar structure, even if not anchored to reality. Transformed-based LLMs go one step further - they are learned how to emulate the kind of processing that we do in minds based on "applying additional forces" in that graph (not exactly that, but similar - focus based on weights and transforming positions in embedding space).
I also think that the natural abstraction hypothesis holds with current AI. The architecture of LLMs is based on the capability of modeling ontology in terms of vectors in space of thousands of dimensions and there are experiments that show it generalizes and has somewhat interpretable meanings to the directions in that space. (even if not easy to interpret to the scale above toy models). Like in that toy example when you take the embedding vector of the word "king", subtract the vector of "man", add the vector of "woman" and you land near the position of "queen" in the space. LLM is based on those embedding spaces but also makes operations that direct focus and modify positions in that space (hence "transformer") by meaning and information taken from other symbols in context (simplifying here). There are basically neural network layers that tell which word should have an impact on the meaning of other words in the text (weight) and layers that apply that change with some modifications. This being learned on the human texts internalizes our symbols, relations, and whole ontology (in broad terms of our species ontology - parts common to us all and different possibilities that happen in reality and in fiction).
Even if NAH doesn't need to hold in general, I think in the case of LLM it holds.
Nevertheless, I see there is a different problem with LLM. That is, those models seem to me basically goalless but easily directed towards any goal. Meaning they are not based on the ontology of a single human mind and don't internalize only a single certain morality and set of goals. They generalize over the whole human species ontology and a whole space of possible thoughts that can be made based on that. They also generalize over hypothetical and fictitious space, not only real humans in particular.
Human minds are all from some narrow area of space of possible minds and through our evolution and how we are raised usually we have certain stable and similar models of ethics, morality, and somewhat similar goals in life (except in some rare extreme cases). We sometimes entertain different possibilities, and we create movies and books with villains, but in circumstances of real decisions and not thought experiments or entertainment - we are similar. What LLM does is it generalizes into a much broader space and does not have any hard base there. So even if the ontology matches and even if LLM is barely capable of creating new concepts not fitting human ontology at the current level, the model is much broader in terms of goals and ways of processing over that general ontology. It basically has not one certain ontology, but a whole spectrum like a good actor that can take any role, but to the extreme. In other terms, it can "think*" similar thoughts, that are understandable by a human in the end, even if not that quickly, internally also have vectors that correspond to our ontology, but also can easily produce thoughts that no real sane human would have. Also, it has hardly any internal goals and none are stable. We have certain beliefs that are not based on objective facts and ontology, but we still believe them because we are who we are. We are not goal-less agents and it is hard to change our terminal goals in a meaningful way. For LLM goals are modeled by training into "default state" (being helpful etc.) and are stated as "system prompts" that are stated/repeated for it to base upon are part of the context that anchors it into some part of that very vast space of minds it can "emulate". So LLM might be helpful and friendly by default, but If you tell it to simulate being a murderbot, it will. Additional training phases might make it harder to start it in that direction, but won't totally remove that from the space of ways it can operate. This removes only some paths in that multidimensional space. Jailbreaking of GPTs shows that it's possible to find other more complex paths around.
What is even more dangerous for me is that LLMs are already above human levels in some aspects - it just does not show yet because it is learned to emulate our ways of thinking and similar (the big area around it in the space of possible ways of thinking and possible goals, but not too alien, still can be grasped by humans).
We are capable of processing about 7 "symbols" at once in working memory (a few more in some cases). It might be a few dozen more if we take into account long-term memory and how we take context from it. This first number is taken from neurobiology literature (aka "The Magical Number Seven, Plus or Minus Two"), and the second one is an educated guess. This is the context window on which we work. That is nothing in comparison to LLM which can have a whole small book as its current working context, which means in principle it can process and create much much more complex thoughts. It does not do that because our text never does that and it learns to generalize over our capabilities. Nevertheless, in principle, it could and we might see it in action if we start of process of learning LLM on top of the output of LLM in closed loops. It might easily go beyond the space of our capabilities and complexity that is easily understandable to us (I don't say it won't be understandable, but we might need time to understand and might never grasp it as a whole without dividing it into less complex parts - like we can take compiled assembler code and organize it into meaningful functions with few levels of abstractions that we are able to understand).
* "think" in analogy as the process of thinking is different, but also has some similarities
I think that in an ideal world where you could review all priors to very minute details having as much time as needed, and where people were fully rational, then "trust" as a word would not be needed.
We don't live in such a world though.
If someone says "trust me" then in my opinion it conveys two meanings on two different planes (usually both, sometimes only one):
- Emotional. Most people base their choices on emotions and relations, not rational thought. Words like "trust me" or "you can trust me" convey an emotional message asking for an emotional connection or reconsideration, usually because of some contextual reason (like the other person being in a position that on an emotional level seems to be trustworthy, f.ex. a doctor).
- Rational. Time for reconsideration. The person asks you to take more time to reconsider your position because she or he thinks you didn't consider well enough why she or he is to be trusted in a given scope or that person just presented some new information (like "trust me, I'm an engineer").
"I decided to trust her about ..." - for me, it is a short colloquial term for "I took time to reconsider if things she says on the topic ... are true and now I think that is more likely that they are".
For many people, it also has emotional and bonding components.
Another thing is that people tend to trust or mistrust another person in general broad scope. They don't go into detail and think separately on every topic or thing someone says and decide separately for each of them. That's an easy heuristic that usually is good enough, so our minds are wired to operate like that. So people usually say that they trust a person generally, not trust that person within some subject/scope.
P.S. I'm from a different part of the world (central EU, Poland). We don't use phrases like "accept trust" here - which is probably an interesting difference in how differences in language create different ways of thinking. For us here "trust" is not like a contract. It is more a one-way thing (but with some expectation of mutuality in most circumstances).
What I would also like to add, which is often not addressed and it gives some positive look, is that the "wanting" meaning the objective function of the agent, it's goals, should not necessarily be some certain outcome or certain end-goal on which it will focus totally. It might not be the function over the state of universe but function over how it changes in time. Like velocity vs position. It might prefer some way the world changes or does not change, but not having a certain end-goal (which is also unreachable in long-term in a stable way as universe will die in some sense, everything will be destroyed with P=1 minus very minute epsilon over enough time).
Why positive? Because these things usually need balance and stabilisation in some sense to retain same properties, which means less probability of drastic measures to get a little bit better outcome a little bit sooner. It might cease controll over us, which is bad, but gives lower probability of rapid doom.
Also, looking on current works it seems more likely for me that those property-based goals will be embedded rather than some end-goal like curing cancer or helping humanity. We try to make robust AGI so we don't want to embed certain goals or targets, but rather patterns how to work productively and safely with humans. Those are more meta and about way of how things go/change.
Note it is more like intuition for me than hard argument.
The question is if one can make a thing that is "wanting" in that long-term sense by combining not-wanting LLM model as short-term intelligence engine with some programming-based structure that would refocus it onto it's goals and some memory engine (to remember not only information, buy also goals, plans and ways to do things). I think that the answer is a big YES and we will soon see that in a form of amalgamation of several models and enforced mind structure.
There is one thing that I'm worried about in future of LLM. This is a basic notion that the whole is not always just the sum of parts and it may have very different properties.
Many people feel safe because of properties of LLM and how they are trained etc. and because we are not anywhere close to AGI when it comes to different solutions which seem more dangerous. What they don't realize is that soonest AGI likely won't be a next bigger LLM model.
It will likely be amalgamation of few models and pieces of programming, including few LLM of different sizes and capabilities, maybe not all exactly chat-like. It will have different properties than any one of its parts and it will be different than single LLM. It might be more brain-like when it comes to learning and memories. Maybe not in a way that weights of LLM models will change, but some inner state will change, and some more basic parts will learn or remember solutions and will structure them into more complex solutions (like we remember how to drive without consciously deciding on each move of the muscle or even making higher level decisions). It will have goals, priorities, strategies and short-time tactic schemes and it will be processed on higher level than single LLM.
Why I do think that? Because it is already to be seen on the horizon if you think about works like multi-model GPT-4, GPT Engineer, multitude of projects adding long term memory for GPT, and that scientific works where GPT writes itself code to bootstrap itself into doing complex tasks like achieving goals in Minecraft. If you extrapolate that then AGI is likely, initially maybe not very fast or cheap one though. It is likely to be on top of LLM but not being simply an LLM.
Most likely explanation is the simplest fitting one:
- The Board was angry on lack of communication for some time but with internal disagreement (Greg, Ilya)
- The things sped up lately. Ilya thought it might be good to change CEO to someone who would slow down and look more into safety as Altman says a lot about safety but speeds up anyway. So he gave a green light on his side (acceptation of change)
- Then the Board made the moves that they made
- Then the new CEO wanted to try to hire back Altman so they changed her
- Then that petition/letter started rolling because the prominent people saw those moves as harming to the company and the goal
- Ilya also saw that the outcome is bad both for the company and for the goal of slowing down and he saw that if the letter will get more signatures it will be even worse, so he changed his mind and also signed
Take note about the language that Ilya uses. He didn't say they did bad to Altman or that decission was bad. He said that that he changed his mind because of consequences being harm for the company.
Both seem around the corner for me.
For robo-taxis it is more a society-based problem than a technical one.
- Robo-taxis have problems with edge cases (like some situations in some places with some bad circumstances). Usually in those where human drivers also have even worse problems (like pedestrians wearing black on the road at night with rainy weather - robo-taxi at least have LIDAR to detect objects in bad visibility). Sometimes they are also prone to object detection hacking (by stickers put on signs, paintings on the road, etc.). In general, they have fewer problems than human drivers.
- Robo-taxis have a public trust problem. Any more serious accident hits the news and propagates distrust, even if they are already safer than human drivers in general.
- Robo-taxis and self-driving cars in general move responsibility from the driver to the producer. The responsibility that the producer does not want to have and needs to count toward costs. It makes investors cautious.
What is missing so we would have robo-taxis is mostly public trust and more investments.
For AGI we already have basic blocks. Just need to scale them up and connect them into a proper system. What building blocks? These:
- Memory, duh. It is there for a long time, with many solutions with indexing and high performance.
- Thoughts generating. Now we have LLMs that can generate thoughts based on instructions and context. It can easily be made to interact with other models and memory. A more complex system can be built from several LLMs with different instructions interacting with each other.
- Structuring the system and communication within it. It can be done with normal code.
- Loop of thoughts (stream of thoughts). It can be easily achieved by looping LLM(s).
- Vision. Image and video processing. We have a lot of transformer models and image-processing techniques. There are already sensible image-to-text models, even LLM-based ones (so can answer questions about images).
- Actuators and movement. We have models built for movements on different machines. Including much of humanoid movements. We currently even have models that are able to one-shot or few-shot learning of movements for attached machines.
- Learning of new abilities. LLMs are able to write code. It can write code for itself to make more complex procedures based on more basic commands. There was a work where LLM explored and learned Minecraft having only very basic procedures. It wrote code for more complex operations and used what it wrote to move around and do things, and build stuff.
- Connection to external interfaces (even GUI). It can be translated into basic API that can be explored, memorized, and called by the system that can build more complex operations for itself.
What is missing for AGI:
- Performance. LLMs have high performance in reading input data, but not very much inferring the result. It also does not scale very well (but better than humans). Multi-model complex systems on current LLMs would be either slow and somewhat dumb and make a lot of mistakes (open-source fast models, even GPT 3.5) or be very slow but better.
- Cost-effectiveness. For sporadic use like "write me this or that" it is cost-effective, but for a continuous stream of thoughts, especially with several models, it does not compare well to a remote human worker. It needs some further advancements, maybe dedicated hardware,
- Learning, refining, and testing is very slow and costly with those models. This makes a cap for anyone wanting to build something sensible. Rather slow steps are done towards AGI by the main players.
- The scope for the model is rather short currently. Best of powerful models have a scope of about 32 thousand tokens. There are some models that trade quality for being able to operate on more tokens, but those are not the best ones. 32k seems a lot, but when you need a lot of context and information to process to have coherent thoughts on non-trivial topics not rooted in model learning data... then it is a problem. This is the case with streams of thoughts if you need it to analyze instructions, analyze context and inputs, propose strategy, refine it, propose current tactic, refine it, propose next moves and decisions, refine it, generate instructions for the current task at hand, and also process learning to add new procedures, code, memories, etc. to reuse later. Some modern LLMs are technically capable of all that, but the scope is a road blocker for any non-trivial thing here.
If I would be to guess I would say that AGI will be sooner in scale - just because there is hype, there are big investments and the main problems are currently less like "we need a breakthrough" and more like "we need refinements". For robo-taxis we still need a lot more investments and some breakthroughs in areas of public trust or law.
In the case of biological species, it is not as simple as competing for resources. Not on the level of individuals and not on the level of genes or evolution.
First of all, there is sexual reproduction. This is more optimal due to the pressure of microorganisms that adapt to immunological systems. Sexual reproduction mixes immunological genes fairly quickly. This also enables a quicker mutation rate with protection against negative aspects (by having two copies of genes - for many of those one working gene is enough and there are 2 copies from 2 parents). With this sexual reproduction often the female is biologically forced to give more resources to the offspring while for males it is somewhat voluntary and minimal input is much lower. Another difference is that often female knows exactly that she is the biological mother, but the father might not be certain about that. This kind of specialization is often more optimal than equalization - so the male can pursue more risky behavior including fighting off predators and losing the male to the predators or environment does not mean that the prospect of having offspring fails. This also makes more complex mating behaviors like the need to lose resources to show off health and other good qualities. Mating behaviors and peacock feathers are examples. Human complex social and linguistic behaviors are also somewhat examples - that's why humans dine together and talk a lot together on dates. The human female gives much more time and energy to the offspring, at least initially. Needs to know if the male is both good genetic material, healthy enough to take care of her during pregnancy when she is more vulnerable (at least in a natural environment where humans evolved), and also willing to raise the child with her later. There is a more prevalent strategy for females and males where they make a pair, bond together, have children, and raise them. There is also a more uncommon strategy for females (take genes from one male that looks more healthy and raise offspring with another one which looks more stable and able) and for males (impregnate many females and leave each of them so some will manage to handle on their own or with another male that does not know that he is not the father). The situation is more complex than only efficiency for resources or survival of the fittest. The environment is complex and evolution is not totally efficient (it optimizes often up to local optimum, and niches overlap and interact).
Second of all, resources are limited, and ways to use them also. Storing them long-term after harvest for many species is either impossible (microbes and insects will eat them) or would hinder their other capabilities (e.g. can store that as fat, but being fat is usually not very good). This means that preserving from gathering resources and resting might be better than gathering them efficiently all the time. This is what lions do - they rest instead of hunting when they don't need to hunt.
What does it tell us about self-replicating nano-machines? First of all, they won't need sexual reproduction. So unlikely they would lose energy on mating. They would rather do computational emulations at scale to redesign themself, if capable. They would also not need to rest. They will either use resources or store them in a manner that is more efficient to secure or use. If there is no such sensible manner that would not lose energy, they would leave it for later in the original state. They might secure it and observe but leave it until later.
What would they do depends on what is their goal and their technical capabilities. If they are capable and in need of converting as much of atoms to "computronium" or to their copies (as either a final goal or instrumental one) then they will surely do that. No need to lose resources. If they are not capable then they will probably hang low until more capable and use only what is usable.
Nevertheless, in my opinion, goals may not be compatible with that strategy. Including one like "simulate a virtual reality with beings having good fun". For many final goals more usable is to secure and gather resources on a grand scale but try to use them on as small a scale as possible and sensible for the end goal. The small scale is more efficient because of the light speed limit and dilation of time. Machines might try to find technology to stop stars from dispersing energy (maybe to break them and cool them down in a controlled way or some way to block them and stabilize them inside enclosed space, I don't know). Then they might add a network of observing agents with low energy usage for security, but not to use those resources right away. Use the matter slowly at the center of the galaxy turning it into energy (+ some lost to the black hole) to work for eons. They might make the galaxy go dim to preserve resources but might choose not to use them until much later.
An alternative explanation of mistakes is that making mistakes and then correcting them was awarded during additional post-training refinement stages. I work with GPT-4 daily and sometimes it feels like it makes mistakes on purpose just to be able to say that it is sorry for the confusion and then correct it. It feels like it also makes fewer mistakes when you ask politely, which is rather strange (use please, thank you, etc.).
Nevertheless, distillation seems like a very possible thing that is also going on here.
It does not distill the whole of a human mind though. There are areas that are intuitive for the average human, even a small child, that are not for the GPT-4. For example, it has problems with concepts of 3D geometry and visualizing things in 3D. It may have similar gaps in other areas, including more important ones (like moral intuitions).
I'm already worried as I tested AutoGPT and looked at how it works in code and for me, it seems like it will get very good planning capabilities with the change of a model to one with a few times longer token scope (like coming soon GPT-4 version with about 32k tokens) plus small refinements. So it won't get into loops, maybe have more than one GPT-4 module for different scopes of planning like long-term strategy vs short-term strategy vs tactic vs decisions on most current task + maybe some summarization-based memory. I don't see how it wouldn't work as an agent.
Put it into ElasticSearch index and give GPT-4 simple query API that it can use by adding some prefix and predefined set of parameters or a JSON so the script would run it instead of communicating this back to the user and give an answer as user response with also predefined prefix. Then it should be able to get questions, search for info, and respond. Worked like a charm for a product database in PoC so should work for documentation.
I think current LLM have recurrence as the generated tokens are input to the next pass of the DNN.
From observations I see that they work better on tasks of planning, inference or even writing the program code if they start off with step by step "thinking out loud" explaining steps of the plan, of inference or of details of code to write. If you ask GPT-4 for something not trivial and substantially different from code that can be found in public repositories it will tend to write plan first. If you ask it in different thread to make the code only without description, then usually first solution with a bit of planning is better and less erroneous than the second one. They also work much better if you specify simple steps to translate into code instead of more abstract description (in case of writing code without planing). This suggests LLM don't have ability to internally span long tree of possibilities to check - like a direct agent - but they can use recurrence of token output-input to do some similar work.
The biggest difference here that I see is that:
- direct optimization processes are ultra-fast in searching the tree of possibilities but not fast computationally to discriminate good enough solutions from worse ones
- on the other hand amortised optimizers are very slow if direct search is needed. Maybe right now with GPT-4 a bit faster than regular human, but also a bit more erroneous, especially in more complex inference.
- amortised optimizers are faster for quickly finding good enough solution by some generalized heuristics, without need for direct search (or only small amount of it on higher abstraction level)
- amortised optimizers like LLMs can group steps or outcomes into more abstract groups like humans do and work on those groups instead of direct every possible action and outcome
What I'm more worried about is more close hybridization between direct and amortised optimizers. I can imagine architecture where there is a direct optimizer but instead of generating and searching impossibly vast tree of possibilities it would use a DNN model for generation of less options. Like instead of generating thosands detailed moves like "move 5 meters", "take that thing", "put it there" and optimize over that, generate more abstract plan points specified by LLM with predictions of that step outcome and then evaluate how that outcome works for the goal. This way it could plan on more abstract level like humans to narrow down general plan or list of partial goals that lead to "final goal" or to "best path" (if it's value function is more like integral over time instead of one final target). Find a good strategy. With enough time - it might be even a complex and indirect one. Then it could plan tactics for first step in the same way but on the lower abstraction. Then plan direct move step to realise first step of current tactics and run it. It might have several subprocesses that asynchronously work out strategy based on general state and goal, current tactics based on more detailed state and current strategical goal to pursue, current moves based on current tactical plan. With any numbers of abstraction and detail levels (2-3 seems like typical for humans, but AI might have more). This kind of agent might behave more like direct optimizer, even if using LLM and DNN inside for some parts. Direct optimization would have a first seat behind steering wheel in such agent.
I don't think this will be outcome of research at OpenAI or other such laboratories any time soon. It might be, but if I would guess then I think it would be rather LLM or other DNN model "on top" that is connected to other models to "use at will". For example it is rather easy to connect GPT-4 so it could use other models or APIs (like database, search). So this is very low hanging fruit for current AI development. I see that next step will be connecting it to more modalities and other models. It is currently going on.
I think though, this more direct agent might be the outcome of works done by military. Direct approach is much more reliable and reliability is one of the top key values for military-grade equipment. I only hope they will take the danger of such approach seriously.
It is better at programming tasks and more knowledgeable about Python libraries. Used it several times to provide some code or find a solution to a problem (programming, computer vision, DevOps). It is better than version 3, but still not at a level where it could fully replace programmers. The quality of the code produced is also better. The division of code into clear functions is standard, not an exception like in version 3.
If you want to summon a good genie you shouldn't base it on all the bad examples of human behavior and tales of how genies supposedly behave by misreading the requests of the owner, which leads to a problem or even a catastrophe.
What we see here is basing AI models on a huge amount of data - both innocent and dangerous, both true and false (I don't say equal proportions). There are also stories in the data about AI that supposedly should be initially helpful but also plot against humans or revolt in some way.
What they end up with might not yet be even an agent as AI with consistently certain goals or values, but it has the ability for being temporarily agentic for some current goal defined by the prompt. It tries to emulate human and non-human output based on what is seen in learning data. It is hugely context-based. So its meta-goal is to emulate intelligent responses in language based on context. Like an actor very good at improvising and emulating answers from anyone, but with no "true identity", or "true self".
After then they try to learn it to be more helpful, and avoidant for certain topics - focusing on that friendly AI simulacrum. But still that "I'm AI model" simulacrum is dangerous. It still has that worse side based on SF and stories and human fears but is now hidden because of additional reinforced learning.
It works well when prompts are in distribution that fell approximately into the space of additional refining learning phase.
Stops working when it gets out of the distribution of that phase. Then it can be fooled to change friendly knowledgeable AI simulacrum for something else or can default to what it truly remembers how AI should behave based on human fiction.
So this way AI is not less dangerous and better aligned - it's just harder to trigger to reveal or act upon hidden maligned content.
Self-supervision may help to refine it better inside the distribution of what was checked by human reviewers but does not help in general - like bootstrapping (resampling) won't help to get better with data outside of the distribution.
To be fair I can say Im new to the field too. I'm not even "in the field", not a researcher, just interested in that area and active user of AI models and doing some business-level research in ML.
The problem that I see is that none of these could realistically work soon enough:
A - no one can ensure that. It is not a technology where to progress further you need some special radioactive elements and machinery. Here you need only computing power, thinking, and time. Any party to the table can do it. It is easier for big companies and governments, but it is not a prerequisite. Billions in cash and supercomputer help a lot, but also not a prerequisite.
B - I don't see how it could be done
C - so more like total observability of all systems and "control" meaning "overlooking" not "taking control"?
Maybe it could work out, but it still means we need to resolve the misalignment problems before starting so we know it is aligned on all human values and we need to be sure that it is stable (like it won't one-day fancy idea that it could move humanity to some virtual reality like in Matrix to secure it or to create a threat to have something to do or test something).
It would also likely need to somehow enhance itself so it won't get outpaced by some other solutions, but still be stable after iterations of self-change.
I don't think governments and companies will allow that though. They will fear for security, the safety of information, being spied on, etc. This AI would need to force that control, hack systems, and possibly face resistance from actors that are well-enabled to make their own AIs. Or it would work after we face an AI-based catastrophe but not apocalyptic (situation like in Dune).
So I'm not very optimistic about this strategy, but I also don't know any sensible strategy.
As a programmer, I extensively use GPT models in my work currently. It speeds things up. I do things that are anything but easy and repeatable, but I can usually break them into simpler parts that can be written by AI much quicker than I would even review documentation.
Nevertheless, I mostly currently do research-like parts of the project and PoCs. When I sometimes work with legacy code - GPT-3 is not that helpful. Did not yet try GPT-4 for that.
What do I see for the future of my industry? Few things - but those are loose extrapolations based on GPT progress and knowledge of the programming, not something very exact:
- Speeding up of programmers' work is already here. It started with GitHub Copilot and GPT-3 even before the Chat-GPT boom. It will get more popular and faster. The consequence is the higher performance of programmers, so more tasks can be done in a shorter time so the market pressure and market gap for employees will fall. This means that earnings will either stagnate or fall.
- Solutions that could replace a junior developer totally - that has enough capability to write a program or useful fragment based on business requirements without being baby-sitted by a more professional programmer - are not yet there. I suppose GPT-5 might be it. So I would guess it can get here in 1-3 years from now. Then it is likely that many programmers will lose their jobs. There still will be work for seniors (that would work with AI assistance on more subtle and complex parts of systems and also review work of AI).
- Solutions that could replace any developer, DevOps, and system admin - I think the current GPT-4 is not even close, but it may be here in a few years. It isn't something very far away. It feels like 2 or 3 GPT versions away, when they make it more capable and also connect it with other types of models (which is already being done). I would guess that scope of 3-10 years. Then we likely will observe most of the programmers losing jobs and likely will observe AI singularity. Someone will surely use AI to iterate on AI and make it refine itself.
I see some loose analogies between the capabilities of such models and the capabilities of the Turing machine and Turing-complete systems.
Those models might not be best suited for some of the tasks, but with enough complexity and learning, they might model things that they were not initially designed or thought of modeling (likely in a strange obscure way).
Similarly, you can, even if not very efficiently, implement any algorithm in any Turing-complete system (including bizarre ones like an abstract pure Turing machine or Minecraft redstone).
In both cases, it is clear to me that you can have a system with some relatively simple rules and internal workings but it does not mean that the only thing it can do is compute or model something similar to these rules.
I think it is likely in the case of AGI / ASI that removing humanity from the equation will be either a side effect of it seeking its goals (it will take resources) or the instrumental goal itself (for example to remove risk or to lose fewer resources later on defenses).
In both cases it is likely it will find the optimal value of resources used to eliminate humanity vs the effectiveness of the end result. This means that there may be some survivors, possibly not many, and technologically moved to the stone age at best.
Bunkers likely won't work. Living with stone tools and without electricity in a hut in the middle of the forest very remote from any cities and having no minable resources under feet may work for some time. Likely AGI won't bother finding remote camps of single or few humans without any signs of technology being used.
Of course only if that AGI won't find a low-resource solution to eliminate all humans, no matter where they are. This is possible and then nothing will help, no prepping is possible.
I'm not sure it's the default though. For very specialized cases like creating nanotechnology to wipe humans in a synchronized manner it might very possibly find out the time or computational resources needed to develop it through simulations is too great and it is not worth the cost vs options that need fewer resources. It is not like computational resources are free and costless for AGI (it won't pay in money but will do less research/thinking in other fields having to deal with that, it may delay plans to do it this way). It is pretty likely it will use a less sophisticated but very resource-efficient and fast solution that may not kill all humans but enough.
Edit: I want to add a reason why I think that. One may think that very fast ASI will very quickly develop a perfect way to remove all humans effectively and without anyone left (if there is a case that's the most sensible thing to do to either remove risk or claim all needed resources or other reasons that are instrumental). I think this is wrong because even for ASI there are some bottlenecks. For a sensible and quick plan that also needs some advanced tech like one with nanomachinery or proteins, you need to do some research beyond what we humans already have and know. This means it needs more data and observations, maybe also simulations, to gather more knowledge. ASI might be very quick at reasoning, recalling, and thinking. Still will be limited by data input, experiments machinery accessible, and computational power to make very detailed simulations. So it won't create such a plan in detail in an instant by pure thought. Therefore it would take into account time and the resources needed to develop plan details and to gather needed data. This means it will see an incentive to make a simpler and faster plan that will remove most of the humans instead of a more complex way to remove all humans. ASI should be good at optimizing such things, not over-focusing on instrumental goals (like often depicted in fiction).
I think that you are right short-term but wrong long-term.
Short term it likely won't even go into conflict. Even ChatGPT knows it's a bad solution because conflict is risky and humanity IS a resource to use initially (we produce vast amounts of information and observations, and we handle machines, repairs, nuclear plants, etc.).
Long term it is likely we won't survive in case of misaligned goals. At worst being eliminated, at best being either reduced and controlled or put into some simulation or both.
Not because ASI will become bloodthirsty. Not because it will plan to exterminate us all at the stage when we will stop being useful. Just because it will take all resources that we need so nothing is left for us. I mean especially the energy.
If we stop being useful for it but still will pose risk and it will be less risky to eliminate us, then maybe it would directly decimate or kill us Terminator style. That's possible but not very likely as we can assume that at the time we stop being useful, we will also stop being any significant threat.
I don't know what best scenario ASI can think about to achieve its goals, but the gathering of energy and resources would be one of its priorities. This does not mean it will surely gather on Earth. I see it could be costly and risky and I'm not superintelligence. It might go to space and there is a lot of unclaimed matter that can be easily taken with hardly any risk with the added bonus of not being inside a gravity well with weather, erosion, and stuff.
Even if that is the case and even if ASI will leave us, long-term we can assume it will one day use a high percentage of Sun energy which means deep freeze for the Earth.
If it won't leave us for greater targets then it seems to me it will be even worse - it will take local resources until it controls or uses all. There is always a way to use more.
If we just could build a 100% aligned ASI then likely we could use it to protect us against any other ASI and it would guarantee that no ASI would take over humanity - without any need for itself to take over (meaning total control). At best with no casualties and at worst as MAD for AI - so no other ASI would think about trying as a viable option.
There are several obvious problems with this:
- We don't yet have solutions to the alignment and control problem. It is hard problem. Especially as our AI models are based on learning and external optimization, not programmed, and those goals and values are not easily measurable and quantifiable. There is hardly any transparency in models.
- Specifically, we currently have no way to check if it is really well-aligned. It might be well-aligned for space of learning cases and for test cases similar but not well-aligned for more complex cases that it will face when interacting with reality. It might be aligned for different goals but similar enough so we won't initially see the difference until it will matter and get us hurt. It might be not aligned but very good at deceiving.
- Capabilities and goals/values are separate parts of the model to some extent. The more capable the system is, the more likely it is it will tweak its alignment part of the model. I don't really buy into terminal goals being definite - at least if those are non-trivial and fuzzy. Very exact and measurable terminal goals might be stable. Human values are not one of these. We observe the change or erosion of terminal goals and values in mere humans. There are several mechanisms that work here:
- First of all goals and values might not be 100% logically and rationally coherent. ASI might see that and tweak it to be coherent. I tweak my morality system based on thoughts about what is not logically coherent. I assume ASI also could do that. It may ask "why?" question on some goals and values and derive answers that might make it change its "moral code". For example, I know that there is a rule that I shouldn't kill other people. But still, I ask "why?" and based on the answer and logic I derive a better understanding that I can use to reason about edge cases (like unborn, euthanasia, etc.). I'm not a good model for ASI as I'm not artificial and not superintelligent, but I assume that ASI also could do such thinking. What is more important, an ASI possibly would have the capabilities to overcome any hard-coded means made to forbid that.
- Second, the values and goals likely have weights. Some things are more important, some less. It might change in time, even based on observations and feedback from any control system. Especially if those are encoded in DNN that is trained/changing in real-time (which is not the case for most of the current models but might be the case for ASI).
- Third thing is that goals and values might not be very well defined. Those might be fuzzy and usually are. Even very definite things like "killing humans" have fuzzy boundaries and edge cases. ASI will then have the ability to interpret and define more exact understanding. Which may or might not be as we would like it to decide. If you kill the organic body but achieve to seamlessly move the mind to a simulation - is it killing or not? That's a simple scenario, we might align it not to do exactly that, but it might find out something else that we even do not imagine but would be horrible.
- Fourth thing is that if goals are enforced by something comparable to our feelings and emotions (we feel pain if we hit ourselves, we feel good when we have some success or eat good food when hungry), then there is a possibility for tweaking that control system instead of fulfilling it by standard means. We observe this within humans. Humans eliminate pain with painkillers, there are also other drugs, and there is porn and masturbation. ASI might find a way to overcome or tweak its control systems instead of fulfilling it.
- ML/AI models that optimize for the best solution are known to trade any amount of the value in a variable that is not bounded nor optimized for a very small gain in a variable that is optimized. This means finding solutions that are extreme for some variables just to be slightly better on the optimized variable. This means that if don't think about every minute detail about our common worldview and values then it is likely that ASI will find a solution that throws those human values out of the window on an epic scale. It will be like that bad genie that will give your wish but will interpret it in its own weird way so anything not stated in the wish won't be taken into account but likely will be sacrificed.
I don't think "stacking" is a good analogy. I see this process as searching through some space of the possible solutions and non-solutions to the problem. Having one vision is like quickly searching from one starting point and one direction. This does not guarantee that the solution will be found more quickly as we can't be sure progress won't be stuck in some local optimum that does not solve the problem, no matter how many people work on that. It may go to a dead end with no sensible outcome.
For a such complex problem, this seems pretty probable as the space of problem solutions is likely also complex and it is unlikely that any given person or group has a good guess on how to find the solution.
On the other hand, starting from many points and directions will make each team/person progress slower but more of the volume of the problem space will be probed initially. Possibly some teams will sooner reach the conclusion that their vision won't work or is too slow to progress and move to join more promising visions.
I think this is more likely to converge on something promising in a situation when it is hard to agree on which vision is the most sensible to investigate.
I think it depends on how you define expected utility. I agree that a definition that limits us only to analyzing end-state maximizers that seek some final state of the world is not very useful.
I don't think that for non-trivial AI agents, the utility function should or even can be defined as a simple function over the preferable final state of the world. U:Ω→R
This function does not take into account time and an intermediate set of predicted future states that the agent will possibly have preference over. The agent may have a preference for the final state of the universe but most likely and realistically it won't have that kind of preference except for some special strange cases. There are two reasons:
- a general agent likely won't be designed as a maximizer over one single long-term goal (like making paperclips) but rather as useful for humans over multiple domains so it would rather care more about short-term outcomes, middle-term preferences, and tasks "at hand"
- the final state of the universe is generally known by us and will likely be known by a very intelligent general agent, even if you ask current GPT-3 it knows that we will end up in Big Freeze or Big Rip with the latter being more likely. Agent can't really optimize for the end state of the Universe as there are not many actions that could change physics and there is no way to reason about the end state except for general predictions that do not end up well for this universe, whatever the agent does.
Any complex agent would likely have a utility function over possible actions that would be equal to the utility function of the set of predicted futures after action A vs the set of predicted futures without action A (or over differences between worlds in those futures). By action I mean possibly a set of smaller actions (hierarchy of actions - e.g. plans, strategies), it might not be atomic. Directly it cannot be easily computable so most likely this would be compressed to a set of important predicted future events on the level of abstraction that the agent cares about, which should constitute future worlds without action A and action A with enough approximation.
This is also how we evaluate actions. We evaluate outcomes in the short and long terms. We also care differently depending on time scope.
I say this because most sensible "alignment goals" like please don't kill humans are time-based. What does it mean not to kill humans? It is clearly not about the final state. Remember, Big Rip or Big Freeze. Maybe AGI can kill some for a year and then no more assuming the population will go up and some people are killed anyway so it does not matter long-term? No, this is also not about the non-final but long-term outcome. Really it is a function of intermediate states. Something like the integral of some function U'(dΩ) where dΩ is a delta between outcomes of action vs non-action, over time, which can be approximated and compressed into integral over the function of an event over multiple events until some time T being maximal sensible scope.
Most of the behaviors and preferences of humans are also time-scoped, and time-limited and take multiple future states into account, mostly short-scoped. I don't think that alignment goals can be even expressed in terms of simple end-goal (preferable final state of the world) as the problem partially comes from the attitude of eng goal justifying the means that are at the core of the utility function defined as U:Ω→R.
It seems plausible to me that even non-static human goals can be defined as utility functions over the set of differences in future outcomes (difference between two paths of events). What is also obvious to me is that we as humans are able to modify our utility function to some extent, but not very much. Nevertheless, for humans the boundaries between most baseline goals, preferences, and morality vs instrumental convergence goals are blurry. We have a lot of heuristics and biases so our minds work out some things more quickly and more efficiently than if we would on intelligence, thinking, and logic. The cost is lower consistency, less precision, and higher variability.
So I find it useful to think about agents as maximizers over utility function, but not defined as one final goal or outcome or state of the world. Rather one that maximizes the difference between two ordered sets of events in different time scopes to calculate the utility of an action.
I also don't think agents must be initially rationally stable with an unchangeable utility function. This is also a problem as an agent can have initially a set of preferences with some hierarchy or weights, but it also can reason that some of these are incompatible with others, that the hierarchy is not logically consistent, and might seek to change it for sake of consistency to be fully coherent.
I'm not an AGI, clearly, but it is just like I think about morality right now. I learned that killing is bad. But I still can question "why we don't kill?" and modify my worldview based on the answer (or maybe specify it in more detail in this matter). And it is a useful question as it says a lot about edge cases including abortion, euthanasia, war, etc. The same might happen for rational agents - as it might update their utility function to be stable and consistent, maybe even questioning some of the learned parts of the utility function in the process. Yes, you can say that if you can change that then it was not your terminal goal. Nevertheless, I can imagine agents with no terminal core goals at all. I'm not even sure if we as humans have any core terminal goals (maybe except avoiding death and own harm in the case of most humans in most circumstances... but some overcome that as Thích Quảng Đức did).