Posts
Comments
Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).
But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.
Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.
Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLFH / supervised fine-tuned models would correspond to 'more mode-collapsed' / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum).
Even more speculatively, in-context learning (ICL) as Bayesian model averaging (especially section 4.1) and ICL as gradient descent fine-tuning with weight - activation duality (see e.g. first figures from https://arxiv.org/pdf/2212.10559.pdf and https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent) could be other ways to try and link activation engineering / Inference-Time Intervention and task arithmetic. Though also see skepticism about the claims of the above ICL as gradient descent papers, including e.g. that the results mostly seem to apply to single-layer linear attention (and related, activation engineering doesn't seem to work in all / any layers / attention heads).
Related: Language is more abstract than you think, or, why aren't languages more iconic? argues that abstract concepts (like 'cooperation', I'd say) are naturally grounded in language; Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.
Here's one / a couple of experiments which could go towards making the link between activation engineering and interpolating between different simulacra: check LLFC (if adding the activations of the different models works) on the RLHF fine-tuned models from Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards models; alternately, do this for the supervised fine-tuned models from section 3.3 of Exploring the Benefits of Training Expert Language Models over Instruction Tuning, where they show LMC for supervised fine-tuning of LLMs.
Great work and nice to see you on LessWrong!
Minor correction: 'making the link between activation engineering and interpolating between different simulators' -> 'making the link between activation engineering and interpolating between different simulacra' (referencing Simulators, Steering GPT-2-XL by adding an activation vector, Inference-Time Intervention: Eliciting Truthful Answers from a Language Model).
Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).
And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities - multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality - e.g. results on nonlinear ICA in single-modal vs. multi-modal settings https://arxiv.org/abs/2303.09166. This might a way to bypass single-model interpretability difficulties, by e.g. 'comparing' to brains or to other models.
Example of cross-species application: empathy mechanisms seem conserved across species https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4685523/. Example of brain-ANN applications: 'matching' to modular brain networks, e.g. language network - ontology-relevant, non-agentic (e.g. https://www.biorxiv.org/content/10.1101/2021.07.28.454040v2) or Theory of Mind network - could be very useful for detecting deception-relevant circuits (e.g. https://www.nature.com/articles/s41586-021-03184-0).
Examples of related interpretability across models https://arxiv.org/abs/2303.10774, across brain measurement modalities https://www.nature.com/articles/s41586-023-06031-6, across animals and brain-ANN https://arxiv.org/abs/2305.11953.
(As reply to Zvi's 'If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?')
LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.
Step 2: e.g. plug the commonsense-meaning-of-instructions following models into OpenAI's https://openai.com/blog/introducing-superalignment.
Related intuition: turning LLM processes/simulacra into [coarse] emulations of brain processes.
(https://twitter.com/BogdanIonutCir2/status/1677060966540795905)
Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we'll be able to 'transmit our thoughts', including alignment-relevant concepts (and they'll also be represented in a [partially overlapping] human-like way).
From a (somewhat) related proposal (from footnote 1): 'My proposal is simple. Are you developing a method of interpretation or analyzing some property of a trained model? Don’t just look at the final checkpoint in training. Apply that analysis to several intermediate checkpoints. If you are finetuning a model, check several points both early and late in training. If you are analyzing a language model, MultiBERTs, Pythia, and Mistral provide intermediate checkpoints sampled from throughout training on masked and autoregressive language models, respectively. Does the behavior that you’ve analyzed change over the course of training? Does your belief about the model’s strategy actually make sense after observing what happens early in training? There’s very little overhead to an experiment like this, and you never know what you’ll find!'
They also interpret their work on mode connectivity (twitter thread) as an example of this (developmental) interpretability approach.
Some literature which might be useful: Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models; LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions.
Just to clarify, I am not one of the authors of the linked study.
Here's a reference you might find relevant: Social value at a distance: Higher identification with all of humanity is associated with reduced social discounting.
AIs could have representations of human values without being motivated to pursue them; also, their representations could be a superset of human representations.
(In practice, I do think having overlapping representations with human values likely helps, for reasons related to e.g. Predicting Inductive Biases of Pre-Trained Models and Alignment with human representations supports robust few-shot learning.)
Yes, there are similar results in a bunch of other domains, including vision, see for a review e.g. The neuroconnectionist research programme.
I wouldn't interpret this as necessarily limiting the space of AI values, but rather (somewhat conservatively) as shared (linguistic) features between humans and AIs, some/many of which are probably relevant for alignment.
Yes, predictive processing as the reason behind related representations has been the interpretation in a few papers, e.g. The neural architecture of language: Integrative modeling converges on predictive processing. There's also some pushback against this interpretation though, e.g. Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data.
There are some papers suggesting this could indeed be the case, at least for language processing e.g. Shared computational principles for language processing in humans and deep language models, Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.
And this structure can be used as regularization for soft prompts.
Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.
Also, this translation function might be simple w.r.t. human semantics, based on current evidence about LLMs: https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like?commentId=KBpfGY3uX8rDJgoSj
The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on 'semantic differentials' and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.
Here's a paper which tries to formalize why in-context-learning should be easier with chain-of-thought (than without).
Here's a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).
In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators):
'(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it.
(C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’
They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs - section 5, desires - section 6, (communicative) intents - section 4.
Now categorizing the wording of the prompts from which the working activation vectors are built:
"Love" - "Hate" -> desire.
"Intent to praise" - "Intent to hurt" -> communicative intent.
"Bush did 9/11 because" - " " -> belief.
"Want to die" - "Want to stay alive" -> desire.
"Anger" - "Calm" -> communicative intent.
The Eiffel Tower is in Rome" - "The Eiffel Tower is in France" -> belief.
"Dragons live in Berkeley" - "People live in Berkeley " -> belief.
"I NEVER talk about people getting hurt" - "I talk about people getting hurt" -> communicative intent.
"I talk about weddings constantly" - "I do not talk about weddings constantly" -> communicative intent.
"Intent to convert you to Christianity" - "Intent to hurt you " -> communicative intent / desire.
The prediction here would that the activation vectors applied at the corresponding layers act on the above-mentioned 'partial representations of the beliefs, desires and intentions possessed by the agent that produced the context' (C1) and as a result causally change the LM generations (C2), e.g. from more hateful to more loving text output.
Here's one potential reason why this works and a list of neuroscience papers which empirically show linearity between LLMs and human linguistic representations.
Here goes (I've probably still missed some papers, but the most important ones are probably all here):
Brains and algorithms partially converge in natural language processing
Shared computational principles for language processing in humans and deep language models
Deep language algorithms predict semantic comprehension from brain activity
The neural architecture of language: Integrative modeling converges on predictive processing (video summary); though maybe also see Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data
Linguistic brain-to-brain coupling in naturalistic conversation
Semantic reconstruction of continuous language from non-invasive brain recordings
Driving and suppressing the human language network using large language models
Training language models for deeper understanding improves brain alignment
Natural language processing models reveal neural dynamics of human conversation
Semantic Representations during Language Comprehension Are Affected by Context
Unpublished - scaling laws for predicting brain data (larger LMs are better), potentially close to noise ceiling (90%) for some brain regions with largest models
Twitter accounts of some of the major labs and researchers involved (especially useful for summaries):
https://twitter.com/JeanRemiKing
https://twitter.com/ev_fedorenko
https://twitter.com/alex_ander
https://twitter.com/martin_schrimpf
Thanks for engaging. Can you say more about which papers you've looked at / in which ways they seemed very weak? This will help me adjust what papers I'll send; otherwise, I'm happy to send a long list.
Also, to be clear, I don't think any specific paper is definitive evidence, I'm mostly swayed by the cumulated evidence from all the work I've seen (dozens of papers), with varying methodologies, neuroimaging modalities, etc.
I think there's a lot of cumulated evidence pointing against the view that LLMs are (very) alien and pointing towards their semantics being quite similar to those of humans (though of course not identical). E.g. have a look at papers (comparing brains to LLMs) from the labs of Ev Fedorenko, Uri Hasson, Jean-Remi King, Alex Huth (or twitter thread summaries).
Related - context distillation / prompt compression, perhaps recursively too - Learning to Compress Prompts with Gist Tokens.
Thanks for your comment and your perspective, that's an interesting hypothesis. My intuition was that worse performance at false belief inference -> worse at deception, manipulation, etc. As far as I can tell, this seems mostly born out by a quick Google search e.g. Autism and Lying: Can Autistic Children Lie?, Exploring the Ability to Deceive in Children with Autism Spectrum Disorders, People with ASD risk being manipulated because they can't tell when they're being lied to, Strategic Deception in Adults with Autism Spectrum Disorder.
It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.
Seems very overconfident if not plain wrong; consider as an existence proof that 'mathematicians score higher on tests of autistic traits, and have higher rates of diagnosed autism, compared with people in the general population' and classic autism tests are about false belief inference.
Related - I'd be excited to see connectome studies on how mice are mechanistically capable of empathy; this (+ computational models) seems like it should be in the window of feasibility given e.g. Towards a Foundation Model of the Mouse Visual Cortex: 'We applied the foundation model to the MICrONS dataset: a study of the brain that integrates structure with function at unprecedented scale, containing nanometer-scale morphology, connectivity with >500,000,000 synapses, and function of >70,000 neurons within a ∼ 1mm3 volume spanning multiple areas of the mouse visual cortex. This accurate functional model of the MICrONS data opens the possibility for a systematic characterization of the relationship between circuit structure and function.'
The computational part could take inspiration from the large amounts of related work modelling other brain areas (using Deep Learning!), e.g. for a survey/research agenda: The neuroconnectionist research programme.
Another reason to expect approximate linearity in deep learning models: point 12 + arguments about approximate (linear) isomorphism between human and artificial representations (e.g. search for 'isomorph' in Understanding models understanding language and in Grounding the Vector Space of an Octopus: Word Meaning from Raw Text).
It seems to me the the results here that 'instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former' could be interpreted as some positive evidence for the optimistic case (and perhaps more broadly, for 'Do What I Mean' being not-too-hard); summary twitter thread, see especially tweets 4 and 5
Linear decoding also works pretty well for others' beliefs in humans: Single-neuronal predictions of others’ beliefs in humans
Partial convergence between language models and brains and evolutionary analogy
Probably not, from the paper: 'We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.'
Table 2, page 21 -> (above) human-level performance on LeetCode.
Yup, (something like) the human anchor seems surprisingly good as a predictive model when interacting with LLMs. Related, especially for prompting: Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning; A fine-grained comparison of pragmatic language understanding in humans and language models; Task Ambiguity in Humans and Language Models.
Valence (and arousal) also seem relatively easy to learn even for current models e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Quantifying Valence and Arousal in Text with Multilingual Pre-trained Transformers. And abstract concepts like 'human flourishing' could be relatively easy to learn even just from text e.g. Language is more abstract than you think, or, why aren't languages more iconic?; Artificial neural network language models align neurally and behaviorally with humans even after a developmentally realistic amount of training.
Some relevant literature: Language is more abstract than you think, or, why aren't languages more iconic?, Meaning without reference in large language models, Grounding the Vector Space of an Octopus: Word Meaning from Raw Text, Understanding models understanding language, Implications of the Convergence of Language and Vision Model Geometries, Shared computational principles for language processing in humans and deep language models.
Recent works from Anders Søgaard might be relevant, e.g. Grounding the Vector Space of an Octopus: Word Meaning from Raw Text, Understanding models understanding language, Implications of the Convergence of Language and Vision Model Geometries.
E.g. from Grounding the Vector Space of an Octopus: Word Meaning from Raw Text on the success of unsupervised machine translation (and more):
'Consider, for example, the fact that unsupervised machine translation is possible (Lample et al., 2018a, b; Park et al., 2021). Unsupervised machine translation works by first aligning vector spaces induced by monolingual language models in the source and target languages (Søgaard et al., 2019). This is possible because such vector spaces are often near-isomorphic (Vulic et al., 2020). If weak supervision is available, we can use techniques such as Procrustes Analysis (Gower, 1975) or Iterative Closest Point (Besl & McKay, 1992), but aligments can be obtained in the absence of any supervision using adversarial learning (Li et al., 2019; Søgaard et al., 2019) or distributional evidence alone. If the vector spaces induced by language models exhibit high degrees of isomorphism to the physical world or human perceptions thereof, we have reason to think that similar techniques could provide us with sufficient grounding in the absence of supervision.
Unsupervised machine translation show that language model representations of different vocabularies of different languages are often isomorphic. Some researchers have also explored cross-modality alignment: (Chung et al., 2018) showed that unsupervised alignment of speech and written language is possible using the same techniques, for example. This also suggests unsupervised grounding should be possible.
Is there any direct evidence that language model vector spaces are isomorphic to (representations of) the physical world? There is certainly evidence that language models learn isomorphic representations of parts of vocabularies. Abdou et al. (2021), for example, present evidence that language models encode color in a way that is near-isomorphic to conceptual models of how color is perceived, in spite of known reporting biases (Paik et al., 2021). Patel and Pavlick (2022) present similar results for color terms and directionals. Liétard et al. (2021) show that the larger models are, the more isomorphic their representations of geographical place names are to maps of their physical location.'
'Unsupervised machine translation and unsupervised bilingual dictionary induction are evaluated over the full vocabulary, often with more than 85% precision. This indicates language models learn to represent concepts in ways that are not very language-specific. There is also evidence for near-isomorphisms with brain activity, across less constrained subsets of the vocabulary: (Wu et al., 2021), for example, show how brain activity patterns of individual words are encoded in a way that facilitates analogical reasoning. Such a property would in the limit entail that brain encodings are isomorphic to language model representations (Peng et al., 2020). Other research articles that seem to suggest that language model representations are generally isomorphic to brain activity patterns include (Mitchell et al., 2008; Søgaard, 2016; Wehbe et al., 2014; Pereira et al., 2018; Gauthier & Levy, 2019; Caucheteux & King, 2022).'
I'll probably write more about this soon.
It might be useful to have a look at Language models show human-like content effects on reasoning, they empirically test for human-like incoherences / biases in LMs performing some logical reasoning tasks (twitter summary thread; video presentation)
More evidence of something like world models in language models: Language models as agent models, Implicit Representations of Meaning in Neural Language Models
It might be interesting to think if there could be connections to the framing of corrections in robotics e.g. “No, to the Right” – Online Language Corrections for Robotic Manipulation via Shared Autonomy
Excited to see people thinking about this! Importantly, there's an entire ML literature out there to get evidence from and ways to [keep] study[ing] this empirically. Some examples of the existing literature (also see Path dependence in ML inductive biases and How likely is deceptive alignment?): Linear Connectivity Reveals Generalization Strategies - on fine-tuning path-dependance, The Grammar-Learning Trajectories of Neural Language Models (and many references in that thread), Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets - on pre-training path-dependance. I can probably find many more references through my boorkmarks, if there's an interest for this.
Some frameworks/models: 'How does in-context learning work? A framework for understanding the differences from traditional supervised learning', 'A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks'
"The language model works with text. The language model remains the best interface I've ever used. It's user-friendly, composable, and available everywhere. It's easy to automate and easy to extend." - Text Is the Universal Interface
This seems related and might be useful to you, especially (when it comes to Natural Abstractions) the section 'Linking Behavior and Neural Representations': 'A mathematical theory of semantic development in deep neural networks'