Posts
Comments
I have a longer draft on this, but my current take is the high level answer to the question is similar for crabs and ontologies (&more).
Convergent evolution usually happens because of similar selection pressures + some deeper contingencies.
Looking at the selection pressures for ontologies and abstractions, there is a bunch of pressures which are fairly universal, an in various ways apply to humans, AIs, animals...
For example: Negentropy is costly => flipping less bits and storing less bits is selected for; consequences include
-part of concepts; clustering is compression
-discretization/quantization/coarse grainings; all is compression
...
Intentional stance is to a decent extent ~compression algorithm assuming some systems can be decomposed into "goals" and "executor" (now the cat is chasing a mouse, now some other mouse). Yes this is again not the full explanation because it leads to a question why there are systems in the territory for which this works, but it is a step.
My main answer is capacity constrains at central places. I think you are not considering how small the community was.
One somewhat representative anecdote: sometime in ~2019, at FHI, there was a discussion that the "AI ethics" and "AI safety" research communities seem to be victims of unfortunate polarization dynamics, where even while in the Platonic realm of ideas concerns tracked by the people are compatible, there is somewhat unfortunate social dynamic, where loud voices on both sides are extremely dismissive of the other community. My guess at that time was the divide has decent chance of exploding when AI worries go mainstream (like, arguments about AI risk facing vociferous opposition from part of academia entrenched under the "ethics" flag), and my proposal was to do something about it, as there were some opportunities to pre-empt/heal this, e.g. by supporting people from both camps to visit each others conferences, or writing papers explaining the concerns in a language of the other camp. Overall this was often specific and actionable. The only problem was ... "who has time to work on this", and the answer was "no one".
If you looked at what senior staff at FHI was working on, the counterfactuals were e.g. Toby Ord writing The Precipice. I think even with the benefit of hindsight, that was clearly more valuable - if today you see UN Security Council discussing AI risk and at least some people in the room have somewhat sane models, it's also because a bunch of people at UN read The Precipice and started to think about xrisk and AI risk.
If you looked at junior people, I was juggling already quite high number of balls, including research on active inference minds and implications for value learning, research on technical problems in comprehensive AI services, organizing academic-friendly Human-aligned AI summer school, organizing Epistea summer experiment, organizing ESPR, participating in a bunch of CFAR things. Even in retrospect, I think all of these bets were better than me trying to do something about the expected harmful AI ethics vs AI safety flamewar.
Similarly, we had an early-stage effort on "robust communication", attempting to design a system for testing robustly good public communication about xrisk and similar sensitive topics (including e.g. developing good shareable models of future problems fitting in the Overton window). It went nowhere because ... there just weren't any people. FHI had dozens of topic like that where a whole org should work on them, but the actual attention was about 0.2FTE of someone junior.
Overall I think with the benefit of hindsight, a lot of what FHI worked on was more or less what you suggest should have been done. It's true that this was never in the spotlight on LessWrong - I guess in 2019 the prevailing LW sentiment would be that Toby Ord engaging with UN is most likely useless waste of time.
What were the other options? Have you considered advising xAI privately, or re-directing xAI to be advised by someone else? Also, would the default be clearly worse?
As you surely are quite aware of, one of the bigger fights about AI safety across academia, policymaking and public spaces now is the discussion about AI safety being "distraction" from immediate social harms, and being actually the agenda favoured by the leading labs and technologists. (Often comes with accusations of attempted regulatory capture, worries about concentration of power, etc.)
In my view, given this situation, it seems valuable to have AI safety represented also by somewhat neutral coordination institutions without obvious conflicts of interest and large attack surfaces.
As I wrote in the OP, CAIS made some relatively bold moves to became one of the most visible "public representatives" of AI safety - including the name choice, and organizing the widely reported Statement on AI risk (which was a success). Until now, my impression was when you are taking the namespace, you also aim for CAIS to be such "somewhat neutral coordination institution without obvious conflicts of interest and large attack surfaces".
Maybe I was wrong, and you don't aim for this coordination/representative role. But if you do, advising xAI seems a strange choice for multiple reasons:
1. it makes you somewhat less neutral party for the broader world; even if the link to xAI does not actually influence your judgement or motivations, I think on priors it's broadly sensible for policymakers, politicians and public to suspect all kind of activism, advocacy and lobbying efforts having some side-motivations or conflicts of interest, and this strengthens this suspicion
2. the existing public announcements do not inspire confidence in the safety mindset in xAI founders; it seems unclear whether you advised xAI also about the plan "align to curiosity"
3. if xAI turns to be mostly interested in safety-washing, it's more of problem if it's aided by more central/representative org
Broadly agree the failure mode is important; also I'm fairly confident basically all the listed mentors understand this problem of rationality education / "how to improve yourself" schools / etc. and I'd hope can help participants to avoid it.
I would subtly push back against optimizing for something like being measurably stronger on a timescale like 2 months. In my experience actually functional things in this space typically work by increasing the growth rate of [something hard to measure], so instead of e.g. 15% p.a. you get 80% p.a.
Because his approach does not conform to established epistemic norms on LessWrong, Adrian feels pressure to cloak and obscure how he develops his ideas. One way in which this manifests is his two-step writing process. When Adrian works on LessWrong posts, he first develops ideas through his free-form approach. After that, he heavily edits the structure of the text, adding citations, rationalisations and legible arguments before posting it. If he doesn’t "translate" his writing, rationalists might simply dismiss what he has to say.
cf Limits to legibility ; yes, strong norms/incentives for "legibility" have this negative impact.
I broadly agree with something like "we use a lot of explicit S2 algorithms built on top of the modelling machinery described", so yes, what I mean more directly apply to the low level, than to humans explicitly thinking about what steps to take.
I think practically useful epistemology for humans needs to deal with both "how is it implemented" and "what's the content". To use ML metaphor: human cognition is build out of both "trained neural nets" and "chain-of-thought type inferences in language" running on top of such nets. All S2 reasoning is a prediction in somewhat similar way as all GPT3 reasoning is a prediction - the NN predictor learns how to make "correct predictions" of language, but because the domain itself is partially symbolic world model, this maps to predictions about the world.
In my view some parts of traditional epistemology are confused in trying to do epistemology for humans basically only at the level of the language reasoning, which is a bit like if you try to fix LLM cognition just by writing smart prompts, and ignore there is this huge underlying computation which does the heavy lifting.
I'm certainly in favour of attempts to do epistemology for humans which are compatible with what the underlying computation actually does.
I do agree you can go too far in the opposite direction, ignoring the symbolic reason ... but seems rare when people think about humans?
2. My personal take on dark room problem is it is in case of humans mostly fixed by "fixed priors" on interoceptive inputs. I.e. your body has evolutionary older machinery to compute hunger. This gets fed into the predictive processing machinery as input, and the evolutionary sensible belief ("not hungry") gets fixed. (I don't think calling this "priors" was good choice of terminology...).
This setup at least in theory rewards both prediction and action, and avoids dark room problems for practical purposes: let's assume I have this really strong belief ("fixed prior") I won't be hungry 1 hour in future. Conditional on that, I can compute what are my other sensory inputs half an hour from now. Predictive model of me eating a tasty food in half an hour is more coherent with me being not hungry than predictive model of me reading a book - but this does not need to be hardwired, but can be learned.
Given that evolution has good reasons to "fix priors" on multiple evolutionary relevant inputs, I would not expect actual humans to seek dark rooms, but I would expect the PP system occasionally seeking a way how to block or modify the interoceptive signals
3. My impression about how you use 'frames' is ... the central examples are more like somewhat complex model ensembles including some symbolic/language based components, rather than e.g. "there is gravity" frame or "model of apple" frame. My guess is this will likely be useful for practical use, but with attempts to formalize it, I think a better option is to start with the existing HGM maths.
So far it seems like you are broadly reinventing concepts which are natural and understood in predictive processing and active inference.
Here is rough attempt at translation / pointer to what you are describing: what you call frames is usually called predictive models or hierarchical generative models in PP literature
- Unlike logical propositions, frames can’t be evaluated as discretely true or false.
Sure: predictive models are evaluated based on prediction error, which is roughly a combination of ability to predict outputs of lower level layers, not deviating too much from predictions of higher order models, and being useful for modifying the world. - Unlike Bayesian hypotheses, frames aren’t mutually exclusive, and can overlap with each other. This (along with point Frames in context
Sure: predictive models overlap, and it is somewhat arbitrary where you would draw boundaries of individual models. E.g. you can draw a very broad boundary around a model call microeconomics, and a very broad boundary around a model called Buddhist philosophy, but both models likely share some parts modelling something like human desires - Unlike in critical rationalism, we evaluate frames (partly) in terms of how true they are (based on their predictions) rather than just whether they’ve been falsified or not.
Sure: actually science roughly is "cultural evolution rediscovered active inference". Models are evaluated based on prediction error. - Unlike Garrabrant traders and Rational Inductive Agents, frames can output any combination of empirical content (e.g. predictions about the world) and normative content (e.g. evaluations of outcomes, or recommendations for how to act).
Sure: actually, the "any combination" goes even further. In active inference, there is no strict type difference between predictions about stuff like "what photons hit photoreceptors in your eyes" and stuff like "what should be a position of your muscles". Recommendations how to act are just predictions about your actions conditional of wishful oriented beliefs about future states. Evaluations of outcomes are just prediction errors between wishful models and observations. - Unlike model-based policies, policies composed of frames can’t be decomposed into modules with distinct functions, because each frame plays multiple roles.
Mostly but this description seems a bit confused. "This has distinct function" is a label you slap on a computation using design stance, if the design stance description is much shorter than the alternatives (e.g. physical stance description). In case of hierarchical predictive models, you can imagine drawing various boundaries around various parts of the system (e.g., you can imagine alternatives of including or not including layers computing edge detection in a model tracking whether someone is happy, and in the other direction you can imagine including and not including layers with some abstract conceptions of hedonic utilitarianism vs. some transcendental purpose). Once you select a boundary, you can sometimes assign "distinct function" to it, sometimes more than one, sometimes "distinct goal", etc. It's just a question of how useful are physical/design/intentional stances. - Unlike in multi-agent RL, frames don’t interact independently with their environment, but instead contribute towards choosing the actions of a single agent.
Sure: this is exactly what hierarchical predictive models do in PP. All the time different models are competing for predictions about what will happen, or what will do.
Assuming this more or less shows that what you are talking about is mostly hierarchical generative models from active inference, here are more things the same model predict
a. Hierarchical generative models are the way how people do perception. predictive error is minimized between a stream of prediction from upper layers (containing deep models like "the world has gravity" or "communism is good") and stream of errors from the direction of senses. Given that, what is naively understood as "observations" is ... more complex phenomenon, where e.g. leaf flying sideways is interpreted given strong priors like there is gravity pointing downward, and an atmosphere, and given that, the model predicting "wind is blowing" decreases the sensory prediction error. Similarly, someone being taken into custody by KGB is, under the upstream model of "soviet communism is good" prior, interpreted as the person likely being a traitor. In this case competing broad model "soviet communism is evil totalitarian dictatorship" could actually predict the same person being taken into custody, just interpreting it as the system prosecuting dissidents.
b. It is possible to look at parts of this modelling machinery wearing intentional stance hat. If you do this, the system looks like multi-agent mind, and you can
- derive a bunch of IFC/ICF style of intuitions
- see parts of it as econ interaction or market - the predictive models compete for making predictions, "pay" a complexity cost, are rewarded for making "correct" predictions (correct here meaning minimizing error between the model and the reality, which can include changing the reality, aka pursuing goals)
What's the main difference between naive/straightforward multi-agent mind models is the "parts" live within a generative model, and interact with it and though it, not through the world. They don't have any direct access to reality, and compete at the same time for interpreting sensory inputs and predicting actions.
This seems to be partially based on (common?) misunderstanding of CAIS as making predictions about concentration of AI development/market power. As far as I can tell this wasn't Eric's intention: I specifically remember Eric mentioning he can easily imagine the whole "CAIS" ecosystem living in one floor of DeepMind building.
Thanks for the reply. Also for the work - it's great signatures are added - before I've checked bottom of the list and it seemed it's either same or with very few additions.
I do understand verification of signatures requires some amount of work. In my view having more people (could be volunteers) to process the initial expected surge of signatures fast would have been better; attention spent on this will drop fast.
I feel somewhat frustrated by execution of this initiative. As far as I can tell, no new signatures are getting published since at least one day before the public announcement. This means even if I asked someone famous (at least in some subfield or circles) to sign, and the person signed, their name is not on the list, leading to understandable frustration of them. (I already got a piece of feedback in the direction "the signatories are impressive, but the organization running it seems untrustworthy")
Also if the statement is intended to serve as a beacon, allowing people who have previously been quiet about AI risk to connect with each other, it's essential for signatures to be published. It's nice that Hinton et al. signed, but for many people in academia it would be actually practically useful to know who from their institution signed - it's unlikely that most people will find collaborators in Hinton, Russell or Hassabis.
I feel even more frustrated because this is second time where similar effort is executed by xrisk community while lacking basic operational competence consisting in the ability to accept and verify signatures. So, I make this humble appeal and offer to the organizers of any future public statements collecting signatures: if you are able to write a good statement and secure the endorsement of some initial high-profile signatories, but lack the ability to accept, verify and publish more than a few hundreds names, please reach out to me - it's not that difficult to find volunteers for this work.
I don't think the way you imagine perspective inversion captures typical ways how to arrive at e.g. 20% doom probability. For example, I do believe that there are multiple good things which can happen/be true, decrease p(doom) and I put some weight on them
- we do discover some relatively short description of something like "harmony and kindness"; this works as an alignment target
- enough of morality is convergent
- AI progress helps with human coordination (could be in costly way, eg warning shot)
- it's convergent to massively scale alignment efforts with AI power, and these solve some of the more obvious problems
I would expect prevailing doom conditional on only small efforts to avoid it, but I do think the actual efforts will be substantial, and this moves the chances to ~20-30%. (Also I think most of the risk comes from not being able to deal with complex systems of many AIs and economy decoupling from humans, and single-single alignment to be solved sufficiently to prevent single system takeover by default.)
It's much more natural way how to think about it (cf eg TE Janes, Probability theory, examples in Chapter IV)
In this specific case of evaluating hypothesis, the distance in the logodds space indicates the strength the evidence you would need to see to update. Close distance implies you don't that much evidence to update between the positions (note the distance between 0.7 and 0.2 is closer than 0.9 and 0.99). If you need only a small amount of evidence to update, it is easy to imagine some other observer as reasonable as you had accumulated a bit or two somewhere you haven't seen.
Because working in logspace is way more natural, it is almost certainly also what our brains do - the "common sense" is almost certainly based on logspace representations.
As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is "very unreasonable".
Judging in an informal and biased way, I think some impact is in the public debate being marginally a bit more sane - but this is obviously hard to evaluate.
To what extent more informed public debate can lead to better policy is to be seen; also, unfortunately, I would tend to glomarize over discussing the topic directly with policymakers.
There are some more proximate impacts like we (ACS) are getting a steady stream of requests for collaboration or people wanting to work with us, but we basically don't have capacity to form more collaborations, and don't have capacity to absorb more people unless exceptionally self-guided.
It is testable in this way for OpenAI, but I can't skip the tokenizer and embeddings and just feed vectors to GPT3. Someone can try that with ' petertodd' and GPT-J. Or, you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I'll do, just don't have the time now).
I did some some experiments with trying to prompt "word component decomposition/ expansion". They don't prove anything and can't be too fine-grained, but the projections shown intuitively make sense
davinci-instruct-beta, T=0:
Add more examples of word expansions in vector form
'bigger'' = 'city' - 'town'
'queen'- 'king' = 'man' - 'woman' '
bravery' = 'soldier' - 'coward'
'wealthy' = 'business mogul' - 'minimum wage worker'
'skilled' = 'expert' - 'novice'
'exciting' = 'rollercoaster' - 'waiting in line'
'spacious' = 'mansion' - 'studio apartment'
1.
' petertodd' = 'dictator' - 'president'
II.
' petertodd' = 'antagonist' - 'protagonist'
III.
' petertodd' = 'reference' - 'word'
I don't know / talked with a few people before posting, and it seems opinions differ.
We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.
Also as Kaj wrote, the problem also isn't how to get self-unaligned
Some speculative hypotheses, one more likely and mundane, one more scary, one removed
1. Nature of embeddings
Do you remember word2vec (Mikolov et al) embeddings?
Stuff like (woman-man)+king = queen works in embeddings vector space.
However, the vector (woman-man) itself does not correspond to a word, it's more something like "the contextless essence of femininity". Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).
Note such vectors are closer to the average of all words - i.e. the (woman-man) has roughly zero projections of direction like "what language it is" or "is this a noun" and most other directions in which normal words have large projection
Based on this post, intuitively it seem petertodd embedding could be something like "antagonist - protagonist" + 0.2 "technology - person + 0.2 * "essence of words starting by the letter n"....
...a vector in the embedding space which itself does not correspond to a word, but has high scalar products with words like adversary. And plausibly lacks some crucial features which make it possible to speak the world.
Most of the examples the post seem consistent with this direction-in-embedding space. E.g. imagine a completion of
Tell me the story of "unspeakable essence of antagonist - protagonist"+ 0.2 "technology - person" and ...
What could be some other way to map unspeakeable to speakable? I did a simple experiment not done in the post, with davinci-instruc-beta, simply trying to translate ' petertodd' to various languages. Intuitively, translations often have the feature that what does not precisely correspond to a word in one language does in the other
English: Noun 1. a person who opposes the government
Czech: enemy
French: le négationniste/ "the Holocaust denier"
Chinese: Feynman
...
Why would embedding of anomalous tokens be more like to be this type of vectors, than normal words? Vectors like "woman-man" are closer to the centre of the embedding space, similar to how I imagine anomalous tokens.
In training, embeddings of words drift from origin. Embedding of the anomalous tokens do much less, making them somewhat similar to the "non-word vectors"
Alternatively if you just have a random vector, you mostly don't hit a word.
Also, I think this can explain part of the model behaviour where there is some context. Eg implicitly, in case of the ChatGPT conversations, there is the context of "this a conversation with a language model". If you mix hallucinations with AIs in the context with "unspeakable essence of antagonist - protagonist + tech" ... maybe you get what you see?
Technical sidenote is tokens are not exactly words from word2vec... but I would expect to get roughly word embedding type of activations in the next layers
1I. Self-reference
In Why Simulator AIs want to be Active Inference AIs we predict that GPTs will develop some understanding of self / self-awareness. The word 'self' is not the essence of the self-reference, which is just a ...pointer in a model.
When such self-references develop, in principle they will be represented somehow, and in principle, it is possible to imagine that such representation could be triggered by some pattern of activations, triggered by an unused token.
I doubt this is the case - I don't think GPT3 is likely to have this level of reflectivity, and don't think it is very natural that when developed, this abstraction would be triggered by an embedding of anomalous token.
Thanks for the links!
What I had in mind wasn't exactly the problem 'there is more than one fixed point', but more of 'if you don't understand what did you set up, you will end in a bad place'.
I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don't think this extends to eg. LLMs - if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don't trust the process.
I don't this the self-alignment problem depends of notion of 'human values'. Also I don't think the "do what I said" solves it. Do what I said is roughly "aligning with the output of the aggregation procedure", and
- for most non-trivial requests, understanding what I said depends of fairly complex model of what the words I said mean
- often there will be a tension between your words; strictly interpreted "do not do damage" can mean "do nothing" - basically anything has some risk of some damage; when you tell a LLM to be "harmless" and "helpful", these requests point in different directions
- strong learners will learn what lead you to say the words anyway
Note that this isn't exactly the hypothesis proposed in the OP and would point in a different direction.
OP states there is a categorical difference between animals and humans, in the ability of humans to transfer data to future generation. This is not the case, because animals do this as well.
What your paraphrase of Secrets of Our Success is suggesting is this existing capacity for transfer of data across generations is present in many animals, but there is some threshold of 'social learning' which was crossed by humans - and when crossed, lead to cultural explosion.
I think this is actually mostly captured by .... One notable thing about humans is, it's likely the second time in history a new type of replicator with R>1 emerged: memes. From replicator-centric perspective on the history of the universe, this is the fundamental event, starting a different general evolutionary computation operating at much shorter timescale.
Also ... I've skimmed few chapters of the book and the evidence it gives of the type 'chimps vs humans' is mostly for current humans being substantially shaped by cultural evolution, and also our biology being quite influenced by cultural evolution. This is clearly to be expected after the evolutions run for some time, but does not explain causality that much.
(The mentioned new replicator dynamic is actually one of the mechanisms which can lead to discontinuous jumps based on small changes in underlying parameter. Changing the reproduction number of a virus from just below one to above one causes an epidemic.)
Thanks for the comment.
I do research on empirical agency and it's still surprises me how little the AI-safety community touches on this central part of agency - namely that you can't have agents without this closed loop.
In my view it's one of the results of AI safety community being small and sort of bad in absorbing knowledge from elsewhere - my guess is this is in part a quirk due to founders effects, and also downstream of incentive structure on platforms like LessWrong.
But please do share this stuff.
I've been speculating a bit (mostly to myself) about the possibility that "simulators" are already a type of organism
...What is your opinion on this idea of "loosening up" our definition of agents? I spoke to Max Tegmark a few weeks ago and my position is that we might be thinking of organisms from a time-chauvinist position - where we require the loop to be closed in a fast fashion (e.g. 1sec for most biological organisms).
I think we don't have exact analogues of LLMs in existing systems, so there is a question where it's better to extend the boundaries of some concepts, where to create new concepts.
I agree we are much more likely to use 'intentional stance' toward processes which are running on somewhat comparable time scales.
This whole just does not hold.
(in animals)
The only way to transmit information from one generation to the next is through evolution changing genomic traits, because death wipes out the within lifetime learning of each generation.
This is clearly false. GPT4, can you explain? :
While genes play a significant role in transmitting information from one generation to the next, there are other ways in which animals can pass on information to their offspring. Some of these ways include:
- Epigenetics: Epigenetic modifications involve changes in gene expression that do not alter the underlying DNA sequence. These changes can be influenced by environmental factors and can sometimes be passed on to the next generation.
- Parental behavior: Parental care, such as feeding, grooming, and teaching, can transmit information to offspring. For example, some bird species teach their young how to find food and avoid predators, while mammals may pass on social behaviors or migration patterns.
- Cultural transmission: Social learning and imitation can allow for the transfer of learned behaviors and knowledge from one generation to the next. This is particularly common in species with complex social structures, such as primates, cetaceans, and some bird species.
- Vertical transmission of symbionts: Some animals maintain symbiotic relationships with microorganisms that help them adapt to their environment. These microorganisms can be passed from parent to offspring, providing the next generation with information about the environment.
- Prenatal environment: The conditions experienced by a pregnant female can influence the development of her offspring, providing them with information about the environment. For example, if a mother experiences stress or nutritional deficiencies during pregnancy, her offspring may be born with adaptations that help them cope with similar conditions.
- Hormonal and chemical signaling: Hormones or chemical signals released by parents can influence offspring development and behavior. For example, maternal stress hormones can be transmitted to offspring during development, which may affect their behavior and ability to cope with stress in their environment.
- Ecological inheritance: This refers to the transmission of environmental resources or modifications created by previous generations, which can shape the conditions experienced by future generations. Examples include beaver dams, bird nests, or termite mounds, which provide shelter and resources for offspring.
(/GPT)
Actually, transmitting some of the data gathered during the lifetime of the animal to next generation by some other means is so obviously useful that is it highly convergent. Given the fact it is highly convergent, the unprecedented thing which happened with humans can't be the thing proposed (evolution suddenly discovered how not to sacrifice all whats learned during the lifetime).
Evolution's sharp left turn happened because evolution spent compute in a shockingly inefficient manner for increasing capabilities, leaving vast amounts of free energy on the table for any self-improving process that could work around the evolutionary bottleneck. Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don't think there's anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.
If the above is not enough to see why this is false... This hypothesis would also predict civilizations built by every other species which transmits a lot of data e.g. by learning from parental behaviour - once evolution discovers the vast amounts of free energy on the table this positive feedback loop would just explode.
This isn't the case => the whole argument does not hold.
Also this argument not working does not imply evolution provides strong evidence for sharp left turn.
What's going on?
In fact in my view we do not actually understand what exactly happened with humans. Yes, it likely has something to do with culture, and brains, and there being more humans around. But what's the causality?
Some of the candidates for "what's the actually fundamental differentiating factor and not a correlate"
- One notable thing about humans is, it's likely the second time in history a new type of replicator with R>1 emerged: memes. From replicator-centric perspective on the history of the universe, this is the fundamental event, starting a different general evolutionary computation operating at much shorter timescale.
- Machiavellian intelligence hypothesis suggests that what happened was humans entered a basin of attraction where selection pressure on "modelling and manipulation of other humans" leads to explosion in brain sizes. The fundamental thing suggested here is you soon hit diminishing return for scaling up energy-hungry predictive processing engines modelling fixed-complexity environment - soon you would do better by e.g. growing bigger claws. Unless... you hit the Machiavellian basin, where sexual selection forces you to model other minds modelling your mind ... and this creates a race, in a an environment of unbounded complexity.
- Social brain hypothesis is similar, but the runaway complexity of the environment is just because of the large and social groups.
- Self-domestication hypothesis: this is particularly interesting and intriguing. The idea is humans self-induced something like domestication selection, selecting for pro-social behaviours and reduction in aggression. From an abstract perspective, I would say this allows emergence of super-agents composed of individual humans, more powerful than individual humans. (once such entities exist, they can create further selection pressure for pro-sociality)
or, a combination of the above, or something even weirder
The main reason why it's hard to draw insights from evolution of humans to AI isn't because there is nothing to learn, but because we don't know why what happened happened.
Mostly yes, although there are some differences.
1. humans also understand they constantly modify their model - by perceiving and learning - we just usually don't use the world 'changed myself' in this way
2. yes, the difference in human condition is from shortly after birth we see how our actions change our sensory inputs - ie if I understand correctly we learn even stuff like how our limbs work in this way. LLMs are in a very different situation - like, if you watched thousands of hours of video feeds about e.g. a grouphouse, learning a lot about how the inhabitants work. Than, having dozens of hours of conversations with the inhabitants, but remembering them. Than, watching watching again thousands of hours of video feeds, where suddenly some of the feeds contain the conversations you don't remember, and the impacts they have on the people.
This seems the same confusion again.
Upon opening your eyes, your visual cortex is asked to solve a concrete problem no brain is capable or expected to solve perfectly: predict sensory inputs. When the patterns of firing don't predict the photoreceptor activations, your brain gets modified into something else, which may do better next time. Every time your brain fails to predict it's visual field, there is a bit of modification, based on computing what's locally a good update.
There is no fundamental difference in the nature of the task.
Where the actual difference is are the computational and architectural bounds of the systems.
The smartness of neither humans nor GPTs is bottlenecked by the difficulty of the task, and you can not say how smart the systems are by looking at the problems. To illustrate that fallacy with a very concrete example:
Please do this task: prove P ≠ NP in next 5 minutes. You will get $1M if you do.
Done?
Do you think you have become much smarter mind because of that? I doubt do - but you were given a very hard task, and a high reward.
The actual strategic difference and what's scary isn't the difficulty of the task, but the fact human brain's don't multiple their size every few months.
(edited for clarity)
I don't see how the comparison of hardness of 'GPT task' and 'being an actual human' should technically work - to me it mostly seems like a type error.
- The task 'predict the activation of photoreceptors in human retina' clearly has same difficulty as 'predict next word on the internet' in the limit. (cf Why Simulator AIs want to be Active Inference AIs)
- Maybe you mean something like task + performance threshold. Here 'predict the activation of photoreceptors in human retina well enough to be able to function as a typical human' is clearly less difficult than task + performance threshold 'predict next word on the internet, almost perfectly'. But this comparison does not seem to be particularly informative.
- Going in this direction we can make comparisons between thresholds closer to reality e.g. 'predict the activation of photoreceptors in human retina, and do other similar computation well enough to be able to function as a typical human' vs. 'predict next word on the internet, at the level of GPT4' . This seems hard to order - humans are usually able to do the human task and would fail at the GPT4 task at GPT4 level; GPT4 is able to do the GPT4 task and would fail at the human task.
- You can't make an ordering between cognitive systems based on 'system A can't do task T system B can, therefore B>A' . There are many tasks which human's can't solve, but this implies very little. E.g. a human is unable to remember 50 thousand digit random number and my phone can easily, but there are also many things which human can do and my phone can't.
From the above the possibly interesting direction of comparisons of 'human skills' and 'GPT-4 skills' is something like 'why can't GPT4 solve the human task at human level' and 'why can't human solve the GPT task on GPT4 level' and 'why are the skills are a bit hard to compare'.
Some thoughts on this
- GPT4 clearly is "width superhuman": it's task is ~modelling of textual output of the whole humanity. This isn't a great fit for the architecture and bounds of a single human mind roughly for the same reasons why a single human mind would do worse than Amazon recommender in recommending products to each of hundred million users. In contrast a human would probably do better in recommending products to one specific user whose preferences the human recommender would try to predict in detail.
Humanity as a whole would probably do significantly better at this task, if you e.g. imagine assigning every human one other human to model (and study in depth, read all their text outputs, etc)
- GPT4 clearly isn't "samples -> abstractions" better than humans, needing more data to learn the pattern.
- With overall ability to find abstractions, it seems unclear to what extent did GPT "learn smart algorithms independently because they are useful to predict human outputs" vs. "learned smart algorithms because they are implicitly reflected in human text", and at the current level I would expect a mixture of both
While the claim - the task ‘predict next token on the internet’ absolutely does not imply learning it caps at human-level intelligence - is true, some parts of the post and reasoning leading to the claims at the end of the post are confused or wrong.
Let’s start from the end and try to figure out what goes wrong.
GPT-4 is still not as smart as a human in many ways, but it's naked mathematical truth that the task GPTs are being trained on is harder than being an actual human.
And since the task that GPTs are being trained on is different from and harder than the task of being a human, it would be surprising - even leaving aside all the ways that gradient descent differs from natural selection - if GPTs ended up thinking the way humans do, in order to solve that problem.
From a high-level perspective, it is clear that this is just wrong. Part of what human brains are doing is to minimise prediction error with regard to sensory inputs. Unbounded version of the task is basically of same generality and difficulty as what GPT is doing, and is roughly equivalent to understand everything what is understandable in the observable universe. For example: a friend of mine worked at analysing the data from LHC, leading to the Higgs detection paper. Doing this type of work basically requires a human brain to have a predictive model of aggregates of outputs of a very large number of collisions of high-energy particles, processed by a complex configuration of computers and detectors.
Where GPT and humans differ is not some general mathematical fact about the task, but differences in what sensory data is a human and GPT trying to predict, and differences in cognitive architecture and ways how the systems are bounded. The different landscape of both boundedness and architecture can lead to both convergent cognition (thinking as the human would do) and the opposite, predicting what the human would output in highly non-human way.
The boundedness is overall a central concept here. Neither humans nor GPTs are attempting to solve ‘how to predict stuff with unlimited resources’, but a problem of cognitive economy - how to allocate limited computational resources to minimise prediction error.
Or maybe simplest:
Imagine somebody telling you to make up random words, and you say, "Morvelkainen bloombla ringa mongo."Imagine a mind of a level - where, to be clear, I'm not saying GPTs are at this level yet -
Imagine a Mind of a level where it can hear you say 'morvelkainen blaambla ringa', and maybe also read your entire social media history, and then manage to assign 20% probability that your next utterance is 'mongo'.
The fact that this Mind could double as a really good actor playing your character, does not mean They are only exactly as smart as you.
When you're trying to be human-equivalent at writing text, you can just make up whatever output, and it's now a human output because you're human and you chose to output that.
GPT-4 is being asked to predict all that stuff you're making up. It doesn't get to make up whatever. It is being asked to model what you were thinking - the thoughts in your mind whose shadow is your text output - so as to assign as much probability as possible to your true next word.
If I try to imagine a mind which is able to predict my next word when asked to make up random words, and be successful at assigning 20% probability to my true output, I’m firmly in the realm of weird and incomprehensible Gods. If the Mind is imaginably bounded and smart, it seems likely it would not devote much cognitive capacity to trying to model in detail strings prefaced by a context like ‘this is a list of random numbers’, in particular if inverting the process generating the numbers would seem really costly. Being this good at this task would require so much data and cheap computation that this is way beyond superintelligence, in the realm of philosophical experiments.
Overall I think it is really unfortunate way how to think about the problem, where a system which is moderately hard to comprehend (like GPT) is replaced by something much more incomprehensible. Also it seems a bit of a reverse intuition pump - I’m pretty confident most people's intuitive thinking about this ’simplest’ thing will be utterly confused.
How did we got here?
A human can write a rap battle in an hour. A GPT loss function would like the GPT to be intelligent enough to predict it on the fly.
Apart from the fact that humans are also able to rap battle or impro on the fly, notice that “what would the loss function like the system to do” in principle tells you very little about what the system will do. For example, the human loss function makes some people attempt to predict winning lottery numbers. This is an impossible task for humans and you can’t say much about the human based on this. Or you can speculate about minds which would be able to succeed in this task, but you soon get into the realm of Gods and outside of physics.
Consider that sometimes human beings, in the course of talking, make errors.
GPTs are not being trained to imitate human error. They're being trained to *predict* human error.
Consider the asymmetry between you, who makes an error, and an outside mind that knows you well enough and in enough detail to predict *which* errors you'll make.
Again, from the cognitive economy perspective, predicting my errors would often be wasteful. With some simplification, you can imagine I make two types of errors - systematic, and random. Often the simplest way how to predict the systematic error would be to emulate the process which led to the error. Random errors are ... random, and a mind which knows me in enough detail to predict which random errors I’ll make seems a bit like the mind predicting the lottery numbers.
Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.
GPT obviously isn't going to predict that successfully for significantly-sized primes, but it illustrates the basic point:
There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator's next token.
The general claim that some predictions are really hard and you need superhuman powers to be good at them is true, but notice that this does not inform us about what GPT-x will learn.
Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder?
Yes this is clearly true: in the limit the task is of unlimited difficulty.
I don't mind the post was posted without much editing or work put into formatting but I find it somewhat unfortunate the post was probably written without any work put into figuring out what other people wrote about the topic and what terminology they use
Recommended reading:
- Daniel Dennett's Intentional stance
- Grokking the intentional stance
- Agents and device review
This is great & I strongly endorse the program 'let's figure out what's the actual computational anatomy of human values'. (Wrote a post about it few years ago - it wasn't that fit in the sociology of opinions on lesswrong then).
Some specific points where I do disagree
1. Evolution needed to encode not only drives for food or shelter, but also drives for evolutionary desirable states like reproduction; this likely leads to drives which are present and quite active, such as "seek social status" => as a consequence I don't think the evolutionary older drives are out of play and the landscape is flat as you assume, and dominated by language-model-based values
2. Overall, there is a lot of evolutionary older computations running "on the body"; these provide important source of reward signal for the later layers, and this is true and important even for modern humans. Many other things evolved in this basic landscape
3. The world model isn't a value-indepedent goal-orthogonal model; the stuff it learned is implicitly goal-oriented by being steered by the reward model
4. I'm way less optimistic about "aligning with mostly linguistic values". Quoting the linked post
Many alignment proposals seem to focus on interacting just with the conscious, narrating and rationalizing part of mind. If this is just a one part entangled in some complex interaction with other parts, there are specific reasons why this may be problematic.
One: if the “rider” (from the rider/elephant metaphor) is the part highly engaged with tracking societal rules, interactions and memes. It seems plausible the “values” learned from it will be mostly aligned with societal norms and interests of memeplexes, and not “fully human”.
This is worrisome: from a meme-centric perspective, humans are just a substrate, and not necessarily the best one. Also - a more speculative problem may be - schemes learning human memetic landscape and “supercharging it” with superhuman performance may create some hard to predict evolutionary optimization processes.
In other words, large part of what are the language-model-based values could be just what's memetically fit.
Also, in my impression, these 'verbal' values sometimes seem to basically hijack some deeper drive and channel it to meme-replicating efforts. ("So you do care? And have compassion? That's great - here is language-based analytical framework which maps your caring onto this set of symbols, and as a consequence, the best way how to care is to do effective altruism community building")
5. I don't think that "when asked, many humans want to try to reduce the influence of their ‘instinctual’ and habitual behaviours and instead subordinate more of their behaviours to explicit planning" is much evidence of anything. My guess is actually many humans would enjoy more of the opposite - being more embodied, spontaneous, instinctive, and this is also true for some of the smartest people around.
6. Broadly, I don't think the broad conclusion human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning is stable upon reflection.
I've been part or read enough debates with Eliezer to have some guesses how the argument would go, so I made the move of skipping several steps of double-crux to the area where I suspect actual cruxes lie.
I think exploring the whole debate-tree or argument map would be quite long, so I'll just try to gesture at how some of these things are connected, in my map.
- pivotal acts vs. pivotal processes
-- my take is people's stance on feasibility of pivotal acts vs. processes partially depends on continuity assumptions - what do you believe about pivotal acts?
- assuming continuity, do you expect existing non-human agents to move important parts of their cognition to AI substrates?
-- if yes, do you expect large-scale regulations around that?
--- if yes, will it be also partially automated?
- different route: assuming continuity, do you expect a lot of alignment work to be done partially by AI systems, inside places like OpenAI?
-- if at the same time this is a huge topic for the whole society, academia and politics, would you expect the rest of the world not trying to influence this?
- different route: assuming continuity, do you expect a lot of "how different entities in the world coordinate" to be done partially by AI systems?
-- if yes, do you assume technical features of the system matter? like, eg., multi-agent deliberation dynamics?
- assuming the world notices AI safety as problem (it did much more since writing this post)
-- do you expect large amount of attention and resources of academia and industry will be spent on AI alignment?
--- would you expect this will be somehow related to the technical problems and how we understand them?
--- eg do you think it makes no difference to the technical problem if 300 or 30k people work on it?
---- if it makes a difference, does it make a difference how is the attention allocated?
Not sure if the doublecrux between us would rest on the same cruxes, but I'm happy to try :)
Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.
Maybe I'm getting it wrong, but it seems the model you have for why everyone is not on the ball is something like "people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works" & "this is a type of problem where you can just throw money at it and attract better ML talent".
I don't think these two insights are promising.
Also, again, maybe I'm getting it wrong, but I'm confused how similar you are imagining the current systems to be to the dangerous systems. It seems either the superhuman-level problems (eg not lying in a way no human can recognize) are somewhat continuous with current problems (eg not lying), and in that case it is possible to study them empirically. Or they are not. But different parts of the post seem to point in different directions. (Personally I think the problem is somewhat continuous, but many of the human-in-the-loop solutions are not, and just break down.)
Also, with what you find promising I'm confused what do you think the 'real science' to aim for is - on one hand it seems you think the closer the thing is to how ML is done in practice the more real science it is. On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn't true imo).
To be clear we are explicitly claiming it's likely not the only pressure - check footnotes 9 and 10 for refs.
On the topic thinking about it for yourself and posting further examples as comments...
This is GPT4 thinking about convergent properties, using the post as a prompt and generating 20 plausibly relevant convergences.
- Modularity: Biological systems, like the human brain, display modularity in their structure, allowing for functional specialization and adaptability. Modularity is also found in industries and companies, where teams and departments are organized to handle specific tasks.
- Hierarchical organization: In biological systems, hierarchical organization is common, with higher-level structures built from lower-level components. Hierarchies are also observed in companies and organizations.
- Recurrent connections: Neural networks in the brain have recurrent connections, which allow for feedback and information processing over time. Recurrent structures are also seen in supply chains and communication networks.
- Redundancy: Redundancy in biological systems provides robustness and fault tolerance. This concept is also utilized in industries for backup systems and fail-safe mechanisms.
- Adaptation: Biological systems adapt to changing environments to survive. Adaptation is also a crucial aspect for businesses and industries, where they must adjust to market trends and demands.
- Error correction: Biological systems have mechanisms to detect and correct errors (e.g., DNA repair mechanisms). Error correction is an essential aspect of modern communication systems and data storage.
- Network robustness: Biological networks (e.g., metabolic networks) exhibit robustness against perturbations. Similar robustness is desirable in communication and transportation networks.
- Small-world networks: Biological networks often display small-world properties, with short path lengths and high clustering. These properties are found in social networks and the internet.
- Scale-free networks: Biological networks often exhibit scale-free properties, with a few highly connected nodes (hubs) and many less connected nodes. Scale-free networks are also found in the internet, social networks, and citation networks.
- Sparsity: Neural networks in the brain are sparse, with many fewer connections than theoretically possible. Sparsity is also utilized in machine learning algorithms and data compression techniques.
- Decentralization: Biological systems often rely on decentralized control mechanisms. Decentralization can also be seen in blockchain technology and peer-to-peer networks.
- Homeostasis: Biological systems maintain internal stability through feedback mechanisms. Homeostasis is also relevant to industries, where maintaining stable operating conditions is essential.
- Oscillations: Oscillatory behavior is common in biological systems, such as circadian rhythms. Oscillations can also be observed in economic cycles and traffic patterns.
- Synchronization: Synchronization occurs in biological systems, such as the firing of neurons. Synchronization is also essential in distributed computing and communication systems.
- Division of labor: Division of labor is observed in biological systems (e.g., cells within multicellular organisms) and is a fundamental principle in industries and organizations.
- Cooperation and competition: Biological systems display a balance of cooperation and competition. These dynamics are also observed in economic systems, business strategies, and social interactions.
- Plasticity: Plasticity in biological systems allows for learning and adaptation. In industries, plasticity is important for innovation and adaptation to changing market conditions.
- Evolvability: Biological systems can evolve through mutation and selection. Evolvability is also relevant in industries, where companies must be able to innovate and adapt to survive.
- Self-organization: Self-organization occurs in biological systems, such as pattern formation in developing organisms. Self-organization is also observed in swarm intelligence and decentralized control systems.
- Energy efficiency: Biological systems are optimized for energy efficiency, as seen in metabolic pathways. Energy efficiency is also a crucial consideration in industries and technology development.
In my view
a) it broadly got the idea
b) the result are in my view in a better taste for understand agents than e.g. what you get from karma-ranked LW frontpage posts about AIs on an average day
- The concrete example seems just an example of scope insensitivity?
- I don't trust the analysis of point (2) in the concrete example. It seems plausible turning off the wifi has a bunch of second-order effects, like making various wifi-connected devices not to run various computations happening when online. Consumption of the router could be smaller part of the effect. It's also quite possible this is not the case, but it's hard to say without experiments.
- I would guess most aspiring rationalists are prone to something like an "inverse salt in pasta water fallacy": doing a Fermi estimate like the one described, concluding that everyone is stupid, and stopping to add salt to pasta water.
- but actually people add salt to pasta water because of the taste
- it's plausible salt actually decreases cooking time (or increases cooking time) - but the dominant effect will be chemical interaction with the food. Salt makes some food easier to cook or harder to cook
- overall, focusing on proving something about the non-dominant effect ... is not great
Translating it to my ontology:
1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries.
(the above may be different from what Nate means)
My response:
1. It's plausible people are missing this but I have some doubts.
2. How I think you get actually non-deceptive powerful systems seems different - deception is relational property between the system and the human, so the "deception" thing can be explicitly understood as negative consequence for the world, and avoided using "normal" planning cognition.
3. Stability of this depends on what the system does with internal conflict.
4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.
I don't think in this case the crux/argument goes directly through "the powerful alignment techniques" type of reasoning you describe in the "hopes for alignment".
The crux for your argument is the AIs - somehow -
a. want,
b. are willing to and
c. are able to coordinate with each other.
Even assuming AIs "wanted to", for your case to be realistic they would need to be willing to, and able to coordinate.
Given that, my question is, how is it possible AIs are able to trust each other and coordinate with each other?
My view here is that basically all proposed ways how AIs could coordinate and trust each other I've seen are dual use, and would also aid with oversight/alignment. To take an example from your post - e.g. by opening their own email accounts and emailing each other. Ok, in that case, can I just pretend to be an AI, and ask about the plans? Will the overseers see the mailboxes as well?
Not sure if what I'm pointing to is clear, so I'll try another way.
There is something like "how objectively difficult is to create trust between AIs" and "how objectively difficult is alignment". I don't think these parameters of the world are independent, and I do think that stories which treat them as completely independent are often unrealistic. (Or, at least, implicitly assume there some things which may differentially easy to coordinate a coup relative to making it easy to make something aligned or transparent)
Note that this belief about correlation does not depend on specific beliefs about how easy are powerful alignment techniques.
I would expect the "expected collapse to waluigi attractor" either not tp be real or mosty go away with training on more data from conversations with "helpful AI assistants".
How this work: currently, the training set does not contain many "conversations with helpful AI assistants". "ChatGPT" is likely mostly not the protagonist in the stories it is trained on. As a consequence, GPT is hallucinating "how conversations with helpful AI assistants may look like" and ... this is not a strong localization.
If you train on data where "the ChatGPT character"
- never really turns into waluigi
- corrects to luigi when experiencing small deviations
...GPT would learn that apart from "human-like" personas and narrative fiction there is also this different class of generative processes, "helpful AI assistants", and the human narrative dynamics generally does not apply to them. [1]
This will have other effects, which won't necessarily be good - like GPT becoming more self-aware - but will likely fix most of waluigi problem.
From active inference perspective, the system would get stronger beliefs about what it is, making it more certainly the being it is. If the system "self-identifies" this way, it creates a a pretty deep basin - cf humans. [2]
[1] From this perspective, the fact that the training set is now infected with Sydney is annoying.
[2] If this sounds confusing ... sorry don't have a quick and short better version at the moment.
Seems a bit like too general counterargument against more abstracted views?
1. Hamiltonian mechanics is almost an unfalsifiable tautology
2. Hamiltonian mechanics is applicable to both atoms and starts. So it’s probably a bad starting point for understanding atoms
3. It’s easier to think of a system of particles in 3d space as a system of particles in 3d space, and not as Hamiltonian mechanics system in an unintuitive space
4. Likewise, it’s easier to think of systems involving electricity using simple scalar potential and not the bring in the Hamiltonian
5. It’s very important to distinguish momenta and positions —and Hamiltonian mechanics textbooks make it more confusing
7. Lumping together positions and momenta is very un-natural. Positions are where particle is, momentum is where it moves
A highly compressed version of what the disagreements are about in my ontology of disagreements about AI safety...
- crux about continuity; here GA mostly has the intuition "things will be discontinuous" and this manifests in many guesses (phase shifts, new ways of representing data, possibility to demonstrate overpowering the overseer, ...); Paul assumes things will be mostly continuous, with a few exceptions which may be dangerous
- this seems similar to typical cruxes between Paul and e.g. Eliezer (also in my view this is actually decent chunk of disagreements: my model of Eliezer predicts Eliezer would actually update toward more optimistic views if he believed "we will have more tries to solve the actual problems, and they will show in a lab setting")
- possible crux about x-risk from the broader system (e.g. AI powered cultural evolution); here it's unclear who is exactly where in this debate
- I don't think there is any neat public debate on this, but I usually disagree with Eliezer's and similar "orthodox" views about the relative difficulty & expected neglectedness (I expect narrow single ML system "alignment" to be difficult but solvable and likely solved by default, because incentives to do so; whole-world-alignment / multi-multi to be difficult and with bad results by default)
(there are also many points of agreement)
I'm not really convinced by the linked post
- the chart is from a someone selling financial advice and illustrated elo ratings of chess programs differ from e.g. wikipedia ("Stockfish estimated Elo rating is over 3500") (maybe it's just old?)
- linked interview in the "yes" answer is from 2016
- elo ratings are relative to other players; it is not trivial to directly compare cyborgs and AI: engine ratings are usually computed in tournaments where programs run with same hardware limits
In summary, in my view in something like "correspondence chess" the limit clearly is "AIs ~ human+AI teams" / "human contribution is negligible" .... the human can just do what the engine says.
My guess is the current state is: you could be able to compensate what the human contributes to the team by just more hardware. (i.e. instead of the top human part of the cyborg, spending $1M on compute would get you better results). I'd classify this as being in the AI period, for most practical purposes
Also... as noted by Lone Pine, it seems the game itself becomes somewhat boring with increased power of the players, mostly ending in draws.
Yes, the non-stacking issue in the alignment community is mostly due to the nature of the domain
But also partly due to the LessWrong/AF culture and some rationalist memes. For example, if people had stacked on Friston et. al., the understanding of agency and predictive systems (now called "simulators") in the alignment community could have advanced several years faster. However, people seem to prefer reinventing stuff, and formalizing their own methods. It's more fun... but also more karma.
In conventional academia, researchers are typically forced to stack. If progress is in principle stackable, and you don't do it, it won't be published. This means that even if your reinvention of a concept is slightly more elegant or intuitive to you, you still need to stack. This seems to go against what's fun: I think I don't know any researcher who would be really excited about literature reviews and prefer that over thinking and writing their own ideas. In the absence of incentives for stacking ... or actually presence of incentives against stacking ... you get a lot of non-stacking AI alignment research.
Thanks for the comment. I haven't noticed your preprint before your comment, but it's probably worth noting I've described the point of this post in a facebook post on 8th Dec 2022; this LW/AF post is just a bit more polished and referenceable. As your paper had zero influence on writing this, and the content predates your paper by a month, I don't see a clear case for citing your work.
Mostly agree - my gears-level model is the conversations listed tend to hit Limits to Legibility constraints, and marginal returns drop to very low.
For people interested in something like "Double Crux" on what's called here "entrenched views", in my experience what has some chance of working is getting as much understanding as possible in one mind, and then attempting to match the ontologies and intuitions. (I had some success in this and "Drexlerian" vs "MIRIesque" views)
The analogy I had in mind is not so much in exact nature of the problem, but in the aspect it's hard to make explicit precise models of such situations in advance. In case of nukes, consider the fact that smartest minds of the time, like von Neumann or Feynman, spent decent amount of time thinking about the problems, had clever explicit models, and were wrong - in case of von Neumann to the extent that if US followed his advice, they would have launched nuclear armageddon.
One big difference is GoF currently does not seem that dangerous to governments. If you look on it from a perspective not focusing on the layer of individual humans as agents, but instead states, corporations, memplexes and similar creatures as the agents, GoF maybe does not look that scary? Sure, there was covid, but while it was clearly really bad for humans, it mostly made governments/states relatively stronger.
Taking this difference into account, my model was and still is governments will react to AI.
This does not imply reacting in a helpful way, but I think whether the reaction will be helpful, harmful or just random is actually one of the higher variance parameters, and a point of leverage. (And the common-on-LW stance governments are stupid and evil and you should mostly ignore them is unhelpful in both understanding and influencing the situation.)
- The GoF analogy is quite weak.
- "What exactly" seems a bit weird type of question. For example, consider nukes: it was hard to predict what exactly is the model by which governments will not blow everyone up after use of nukes in Japan. But also: while the resulting equilibrium is not great, we haven't died in nuclear WWIII so far.
Empirically, evolution did something highly similar.
While I have a lot of sympathy for the view expressed here, it seems confused in a similar way to straw consequentialism, just in an opposite direction.
Using the terminology from Limits to Legibility, we can roughly split the way how we do morality into two types of thinking
- implicit / S1 / neural-net type / intuitive
- explicit / S2 / legible
What I agree with:
In my view, the explicit S2 type processing basically does not have the representation capacity to hold "human values", and the non-legible S1 neural-net boxes are necessary for being moral.
Attempts to fully replace the S1 boxes are stupid and lead to bad outcomes.
Training the S1 boxes to be better is often a better strategy than "more thoughts".
What I don't agree with:
You should rely just on the NN S1 processing. (Described in phenomenology way "get moral perception – the ability to recognize, in the heat of the moment, right from wrong" + rely on this)
In my view, the neural-net type of processing has different strength and weaknesses from the explicit reasoning, and they are often complementary.
- both systems provide some layer of reflectivity
- NNs tend to suffer from various biases; often, it is possible to abstractly understand where to expect the bias
- NN represent what's in the training data; often, explicit models lead to better generalization
- explicit legible models are more communicable
"moral perception" or "virtues" ...is not magic, bit also just a computation running on brains.
Also: I think the usual philosophical discussion about what's explanatorily fundamental is somewhat stupid. Why? Consider example from physics, where you can describe some mechanic phenomena using classical terminology of forces, or using Hamiltonian mechanics, or Lagranigan mechanics. If we were as confused about physics as about moral philosophies, there would likely be some discussion about what is fundamental. As we are less confused, we understand the relations and isomorphisms.
What's described as An ICF technique is just that, one technique among many.
ICF does not make the IFS assumption that there is some "neutral self". It makes a prediction that when you unblend few parts from the whole, there is still a lot of power in "the whole". It also makes the claim that in typical internal conflicts and tensions, there just a few parts which are really activated (and not, e.g., 20). Both seems experimentally verifiable (at least in phenomenological sense) - and true.
In my view there is a subtle difference between the "self" frame and "the whole"/"council" frame. The IFS way of talking about "self" seems to lead some IFS practitioners to assume that there is some "self" agent living basically at the same layer of agency as parts.
ICF also makes some normative claims, which make it different from "any type of parts work": the normative claims are about kindness, cooperation and fairness. If you wish, I can easily describe/invent some partswork protocols which would be un-ICF and in my view risky / flawed from ICF perspective. For example, I've somewhat negative prior on techniques trying to do some sort of verbal dialogue in blended state, or techniques doing basically internal blackmail.
In addition, IFS and ICF both seem to emphasize "conversation" as a primary modality, whereas other parts work modalities (e.g. Somatic Experiencing) emphasize other modalities when working with parts, such as somatic, metaphorical, or primal. Again, there's an assumption here about what parts are and how they should be worked with, around the primacy of particular ways of thinking and relating which is heavily (if unconsciously) influenced by the prevalance of IFS and its' way of working.
Not really - the post mentions basically all these directions of variance among ICF techniques:
There are several general axes along which ICF techniques can vary, depending on the person and the circumstance:
- Structuredness
- More: you follow formal idealised steps
- Less: you just sit and stuff happens, with no reference to any steps
- Goal-directedness
- More: you intend to make a decision or resolve a conflict, and you do
- Less: you just want to spend some time with yourself
- Legibility
- More: you make your experience explicit to yourself (and the facilitator if there is one), and have a clear story when you close about what happened
- Less: you’re mostly in direct physical experience, or metaphor, or visualisation; you don’t make any of this very explicit to yourself (or a facilitator); you don’t have much of a story about what happened when you close
...
There are many, many different possible ICF techniques (most of which are presumably undiscovered and could be found via experimentation). To give some examples:
- Different media: drawing, writing, speaking, movement, staying silent…
Maybe the confusion is because just one technique was described explicitly in the post.
My technical explanation for why not direct consequentialism is somewhat different - deontology and virtue ethics are effective theories . You are not almost unbounded superintelligence => you can't rely on direct consequentialism.
Why virtue ethics works? You are mostly a predictive processing system. Guess at simple PP story:
PP is minimizing prediction error. If you take some unvirtuous action, like, e.g. stealing a little, you are basically prompting the pp engine to minimize total error between the action taken, your self-model / wants model, and your world-model. By default, the result will be, your future self will be marginally more delusional ("I assumed they wanted to give it to me anyway") + bit more ok with stealing ("Everyone else is also stealing.")
The possible difference here is in my view you are not some sort of 'somewhat corrupt consequentialist VNM planner' where if you are strong enough, you will correct the corruption with some smart metacognition. What's closer to reality is the "you" believing you are somewhat corrupt consequentialist is to a large extent hallucination of a system for which "means change the ends" and "means change what you believe" are equally valid.
If the strength required to wield the tool of reflective consequentialism was immense, like the strength required to wield Thor's hammer, then I wouldn't hold this position.
The hammer gets heavier with the stakes of the decisions you are making. Also the strength to hold it mostly does not depend on how smart you are, but on how virtuous you are.
Nice, thanks for the pointer.
My overall guess is after surfing through / skimming philosophy literature on this for many hours is you can probably find all core ideas of this post somewhere in it, but it's pretty frustrating - scattered in many places and diluted by things which are more confused.