Posts

[Linkpost] Large language models converge toward human-like concept organization 2023-09-02T06:00:45.504Z
[Linkpost] Robustified ANNs Reveal Wormholes Between Human Category Percepts 2023-08-17T19:10:39.553Z
[Linkpost] Personal and Psychological Dimensions of AI Researchers Confronting AI Catastrophic Risks 2023-08-12T22:02:09.895Z
[Linkpost] Applicability of scaling laws to vision encoding models 2023-08-05T11:10:35.599Z
[Linkpost] Multimodal Neurons in Pretrained Text-Only Transformers 2023-08-04T15:29:16.957Z
[Linkpost] Deception Abilities Emerged in Large Language Models 2023-08-03T17:28:19.193Z
[Linkpost] Interpreting Multimodal Video Transformers Using Brain Recordings 2023-07-21T11:26:39.497Z
Bogdan Ionut Cirstea's Shortform 2023-07-13T22:29:07.851Z
[Linkpost] A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations 2023-07-01T13:57:56.021Z
[Linkpost] Rosetta Neurons: Mining the Common Units in a Model Zoo 2023-06-17T16:38:16.906Z
[Linkpost] Mapping Brains with Language Models: A Survey 2023-06-16T09:49:23.043Z
[Linkpost] The neuroconnectionist research programme 2023-06-12T21:58:57.722Z
[Linkpost] Large Language Models Converge on Brain-Like Word Representations 2023-06-11T11:20:09.078Z
[Linkpost] Scaling laws for language encoding models in fMRI 2023-06-08T10:52:16.400Z

Comments

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2023-08-17T20:24:40.155Z · LW · GW

Agree that it doesn't imply caring for. But I think given cumulating evidence for human-like representations of multiple non-motivational components of affect, one should also update at least a bit on the likelihood of finding / incentivizing human-like representations of the motivational component(s) too (see e.g. https://en.wikipedia.org/wiki/Affect_(psychology)#Motivational_intensity_and_cognitive_scope).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2023-08-04T17:00:40.101Z · LW · GW

But my point isn't just that the AI is able to produce similar ratings to humans' for aesthetics, etc., but that it also seems to do so through at least partially overlapping computational mechanisms to humans', as the comparisons to fMRI data suggest.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2023-08-03T18:00:22.921Z · LW · GW

Eliezer (among others in the MIRI mindspace) has this whole spiel about human kindness/sympathy/empathy/prosociality being contingent on specifics of the human evolutionary/cultural trajectory, e.g. https://twitter.com/ESYudkowsky/status/1660623336567889920 and about how gradient descent is supposed to be nothing like that https://twitter.com/ESYudkowsky/status/1660623900789862401. I claim that the same argument (about evolutionary/cultural contingencies) could be made about e.g. image aesthetics/affect, and this hypothesis should lose many Bayes points when we observe concrete empirical evidence of gradient descent leading to surprisingly human-like aesthetic perceptions/affect, e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs; Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data; Neural mechanisms underlying the hierarchical construction of perceived aesthetic value.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2023-07-25T13:09:22.799Z · LW · GW

Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLFH / supervised fine-tuned models would correspond to 'more mode-collapsed' / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum). 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2023-07-24T17:41:00.779Z · LW · GW

Even more speculatively, in-context learning (ICL) as Bayesian model averaging (especially section 4.1) and ICL as gradient descent fine-tuning with weight - activation duality (see e.g. first figures from https://arxiv.org/pdf/2212.10559.pdf and https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent) could be other ways to try and link activation engineering / Inference-Time Intervention and task arithmetic. Though also see skepticism about the claims of the above ICL as gradient descent papers, including e.g. that the results mostly seem to apply to single-layer linear attention (and related, activation engineering doesn't seem to work in all / any layers / attention heads).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Fundamentally Fuzzy Concepts Can't Have Crisp Definitions: Cooperation and Alignment vs Math and Physics · 2023-07-24T17:12:57.021Z · LW · GW

Related: Language is more abstract than you think, or, why aren't languages more iconic? argues that abstract concepts (like 'cooperation', I'd say) are naturally grounded in language; Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2023-07-23T09:33:53.481Z · LW · GW

Here's one / a couple of experiments which could go towards making the link between activation engineering and interpolating between different simulacra: check LLFC (if adding the activations of the different models works) on the RLHF fine-tuned models from Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards models; alternately, do this for the supervised fine-tuned models from section 3.3 of Exploring the Benefits of Training Expert Language Models over Instruction Tuning, where they show LMC for supervised fine-tuning of LLMs.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity · 2023-07-20T18:24:30.747Z · LW · GW

Great work and nice to see you on LessWrong!

Minor correction: 'making the link between activation engineering and interpolating between different simulators' -> 'making the link between activation engineering and interpolating between different simulacra' (referencing Simulators, Steering GPT-2-XL by adding an activation vector, Inference-Time Intervention: Eliciting Truthful Answers from a Language Model). 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2023-07-15T11:41:31.145Z · LW · GW

Contrastive methods could be used both to detect common latent structure across animals, measuring sessions, multiple species (https://twitter.com/LecoqJerome/status/1673870441591750656) and to e.g. look for which parts of an artificial neural network do what a specific brain area does during a task assuming shared inputs (https://twitter.com/BogdanIonutCir2/status/1679563056454549504).

And there are theoretical results suggesting some latent factors can be identified using multimodality (all the following could be intepretable as different modalities - multiple brain recording modalities, animals, sessions, species, brains-ANNs), while being provably unindentifiable without the multiple modality - e.g. results on nonlinear ICA in single-modal vs. multi-modal settings https://arxiv.org/abs/2303.09166. This might a way to bypass single-model interpretability difficulties, by e.g. 'comparing' to brains or to other models.

Example of cross-species application: empathy mechanisms seem conserved across species https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4685523/. Example of brain-ANN applications: 'matching' to modular brain networks, e.g. language network - ontology-relevant, non-agentic (e.g. https://www.biorxiv.org/content/10.1101/2021.07.28.454040v2) or Theory of Mind network - could be very useful for detecting deception-relevant circuits (e.g. https://www.nature.com/articles/s41586-021-03184-0).

Examples of related interpretability across models https://arxiv.org/abs/2303.10774, across brain measurement modalities https://www.nature.com/articles/s41586-023-06031-6, across animals and brain-ANN https://arxiv.org/abs/2305.11953.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2023-07-14T21:34:56.506Z · LW · GW

(As reply to Zvi's 'If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now?')


LLMs seem to represent meaning in a pretty human-like way and this seems likely to keep getting better as they get scaled up, e.g. https://arxiv.org/abs/2305.11863. This could make getting them to follow the commonsense meaning of instructions much easier. Also, similar methodologies to https://arxiv.org/abs/2305.11863 could be applied to other alignment-adjacent domains/tasks, e.g. moral reasoning, prosociality, etc.


Step 2: e.g. plug the commonsense-meaning-of-instructions following models into OpenAI's https://openai.com/blog/introducing-superalignment.


Related intuition: turning LLM processes/simulacra into [coarse] emulations of brain processes.


(https://twitter.com/BogdanIonutCir2/status/1677060966540795905)

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Bogdan Ionut Cirstea's Shortform · 2023-07-13T22:29:07.930Z · LW · GW

Change my mind: outer alignment will likely be solved by default for LLMs. Brain-LM scaling laws (https://arxiv.org/abs/2305.11863) + LM embeddings as model of shared linguistic space for transmitting thoughts during communication (https://www.biorxiv.org/content/10.1101/2023.06.27.546708v1.abstract) suggest outer alignment will be solved by default for LMs: we'll be able to 'transmit our thoughts', including alignment-relevant concepts (and they'll also be represented in a [partially overlapping] human-like way).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Towards Developmental Interpretability · 2023-07-12T21:08:10.814Z · LW · GW

From a (somewhat) related proposal (from footnote 1): 'My proposal is simple. Are you developing a method of interpretation or analyzing some property of a trained model? Don’t just look at the final checkpoint in training. Apply that analysis to several intermediate checkpoints. If you are finetuning a model, check several points both early and late in training. If you are analyzing a language model, MultiBERTs, Pythia, and Mistral provide intermediate checkpoints sampled from throughout training on masked and autoregressive language models, respectively. Does the behavior that you’ve analyzed change over the course of training? Does your belief about the model’s strategy actually make sense after observing what happens early in training? There’s very little overhead to an experiment like this, and you never know what you’ll find!' 

They also interpret their work on mode connectivity (twitter thread) as an example of this (developmental) interpretability approach.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Translating between Latent Spaces · 2023-07-05T07:24:30.009Z · LW · GW

Some literature which might be useful: Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models; LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Linkpost] A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations · 2023-07-02T16:51:17.060Z · LW · GW

Just to clarify, I am not one of the authors of the linked study.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Collective Identity · 2023-06-28T21:56:54.025Z · LW · GW

Here's a reference you might find relevant: Social value at a distance: Higher identification with all of humanity is associated with reduced social discounting.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Linkpost] Large Language Models Converge on Brain-Like Word Representations · 2023-06-13T07:31:15.365Z · LW · GW

AIs could have representations of human values without being motivated to pursue them; also, their representations could be a superset of human representations.

(In practice, I do think having overlapping representations with human values likely helps, for reasons related to e.g. Predicting Inductive Biases of Pre-Trained Models and Alignment with human representations supports robust few-shot learning.)

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Linkpost] Large Language Models Converge on Brain-Like Word Representations · 2023-06-12T22:24:34.385Z · LW · GW

Yes, there are similar results in a bunch of other domains, including vision, see for a review e.g. The neuroconnectionist research programme

I wouldn't interpret this as necessarily limiting the space of AI values, but rather (somewhat conservatively) as shared (linguistic) features between humans and AIs, some/many of which are probably relevant for alignment.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Linkpost] Large Language Models Converge on Brain-Like Word Representations · 2023-06-12T22:09:33.649Z · LW · GW

Yes, predictive processing as the reason behind related representations has been the interpretation in a few papers, e.g. The neural architecture of language: Integrative modeling converges on predictive processing. There's also some pushback against this interpretation though, e.g. Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Linkpost] Large Language Models Converge on Brain-Like Word Representations · 2023-06-12T22:05:05.603Z · LW · GW

There are some papers suggesting this could indeed be the case, at least for language processing e.g. Shared computational principles for language processing in humans and deep language models, Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Steering GPT-2-XL by adding an activation vector · 2023-06-06T18:54:11.887Z · LW · GW

And this structure can be used as regularization for soft prompts.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Steering GPT-2-XL by adding an activation vector · 2023-06-04T21:22:30.113Z · LW · GW

Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Language Agents Reduce the Risk of Existential Catastrophe · 2023-06-01T18:33:36.857Z · LW · GW

Also, this translation function might be simple w.r.t. human semantics, based on current evidence about LLMs: https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like?commentId=KBpfGY3uX8rDJgoSj

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Steering GPT-2-XL by adding an activation vector · 2023-05-26T08:06:35.217Z · LW · GW

The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on 'semantic differentials' and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Aligned AI via monitoring objectives in AutoGPT-like systems · 2023-05-25T16:12:23.230Z · LW · GW

Here's a paper which tries to formalize why in-context-learning should be easier with chain-of-thought (than without). 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Steering GPT-2-XL by adding an activation vector · 2023-05-25T15:32:05.600Z · LW · GW

Here's a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).

In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators):

'(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it.
(C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’ 

They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs - section 5, desires - section 6, (communicative) intents - section 4.

Now categorizing the wording of the prompts from which the working activation vectors are built:

"Love" - "Hate" -> desire.
"Intent to praise" - "Intent to hurt"  -> communicative intent.
"Bush did 9/11 because" - "      "  -> belief.
"Want to die" - "Want to stay alive" -> desire.
"Anger" - "Calm" -> communicative intent.
The Eiffel Tower is in Rome" - "The Eiffel Tower is in France" -> belief.
"Dragons live in Berkeley" - "People live in Berkeley " -> belief.
"I NEVER talk about people getting hurt" - "I talk about people getting hurt" -> communicative intent.
"I talk about weddings constantly" - "I do not talk about weddings constantly" -> communicative intent.
"Intent to convert you to Christianity" - "Intent to hurt you  " -> communicative intent / desire.
 

The prediction here would that the activation vectors applied at the corresponding layers act on the above-mentioned 'partial representations of the beliefs, desires and intentions possessed by the agent that produced the context' (C1) and as a result causally change the LM generations (C2), e.g. from more hateful to more loving text output.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Steering GPT-2-XL by adding an activation vector · 2023-05-13T23:04:54.663Z · LW · GW

Here's one potential reason why this works and a list of neuroscience papers which empirically show linearity between LLMs and human linguistic representations. 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on LLM cognition is probably not human-like · 2023-05-10T10:05:42.714Z · LW · GW

Here goes (I've probably still missed some papers, but the most important ones are probably all here):

Brains and algorithms partially converge in natural language processing

Shared computational principles for language processing in humans and deep language models

Deep language algorithms predict semantic comprehension from brain activity

The neural architecture of language: Integrative modeling converges on predictive processing (video summary); though maybe also see Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data

Brain embeddings with shared geometry to artificial contextual embeddings, as a code for representing language in the human brain

Artificial neural network language models align neurally and behaviorally with humans even after a developmentally realistic amount of training

Correspondence between the layered structure of deep language models and temporal structure of natural language processing in the human brain

Reconstructing the cascade of language processing in the brain using the internal computations of a transformer-based language model

Linguistic brain-to-brain coupling in naturalistic conversation

Semantic reconstruction of continuous language from non-invasive brain recordings

Driving and suppressing the human language network using large language models

Lexical semantic content, not syntactic structure, is the main contributor to ANN-brain similarity of fMRI responses in the language network

Training language models for deeper understanding improves brain alignment

Natural language processing models reveal neural dynamics of human conversation

Semantic Representations during Language Comprehension Are Affected by Context

Unpublished - scaling laws for predicting brain data (larger LMs are better), potentially close to noise ceiling (90%) for some brain regions with largest models

Twitter accounts of some of the major labs and researchers involved (especially useful for summaries):

https://twitter.com/HassonLab 

https://twitter.com/JeanRemiKing

https://twitter.com/ev_fedorenko

https://twitter.com/alex_ander

https://twitter.com/martin_schrimpf

https://twitter.com/samnastase

https://twitter.com/mtoneva1

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on LLM cognition is probably not human-like · 2023-05-09T08:58:00.681Z · LW · GW

Thanks for engaging. Can you say more about which papers you've looked at / in which ways they seemed very weak? This will help me adjust what papers I'll send; otherwise, I'm happy to send a long list.

Also, to be clear, I don't think any specific paper is definitive evidence, I'm mostly swayed by the cumulated evidence from all the work I've seen (dozens of papers), with varying methodologies, neuroimaging modalities, etc.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on LLM cognition is probably not human-like · 2023-05-08T09:46:35.812Z · LW · GW

I think there's a lot of cumulated evidence pointing against the view that LLMs are (very) alien and pointing towards their semantics being quite similar to those of humans (though of course not identical). E.g. have a look at papers (comparing brains to LLMs) from the labs of Ev Fedorenko, Uri Hasson, Jean-Remi King, Alex Huth (or twitter thread summaries).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Remarks 1–18 on GPT (compressed) · 2023-05-07T13:25:39.267Z · LW · GW

Related - context distillation / prompt compression, perhaps recursively too - Learning to Compress Prompts with Gist Tokens.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on A Case for the Least Forgiving Take On Alignment · 2023-05-07T09:27:21.236Z · LW · GW

Thanks for your comment and your perspective, that's an interesting hypothesis. My intuition was that worse performance at false belief inference -> worse at deception, manipulation, etc. As far as I can tell, this seems mostly born out by a quick Google search e.g. Autism and Lying: Can Autistic Children Lie?, Exploring the Ability to Deceive in Children with Autism Spectrum Disorders, People with ASD risk being manipulated because they can't tell when they're being lied to, Strategic Deception in Adults with Autism Spectrum Disorder.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on A Case for the Least Forgiving Take On Alignment · 2023-05-06T17:47:34.652Z · LW · GW

It's not possible to let an AGI keep its capability to engineer nanotechnology while taking out its capability to deceive and plot, any more than it's possible to build an AGI capable of driving red cars but not blue ones. They're "the same" capability in some sense, and our only hope is to make the AGI want to not be malign.

Seems very overconfident if not plain wrong; consider as an existence proof that 'mathematicians score higher on tests of autistic traits, and have higher rates of diagnosed autism, compared with people in the general population' and classic autism tests are about false belief inference.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Connectomics seems great from an AI x-risk perspective · 2023-05-01T09:52:59.225Z · LW · GW

Related - I'd be excited to see connectome studies on how mice are mechanistically capable of empathy; this (+ computational models) seems like it should be in the window of feasibility given e.g. Towards a Foundation Model of the Mouse Visual Cortex: 'We applied the foundation model to the MICrONS dataset: a study of the brain that integrates structure with function at unprecedented scale, containing nanometer-scale morphology, connectivity with >500,000,000 synapses, and function of >70,000 neurons within a ∼ 1mm3 volume spanning multiple areas of the mouse visual cortex. This accurate functional model of the MICrONS data opens the possibility for a systematic characterization of the relationship between circuit structure and function.' 

The computational part could take inspiration from the large amounts of related work modelling other brain areas (using Deep Learning!), e.g. for a survey/research agenda: The neuroconnectionist research programme.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Deep learning models might be secretly (almost) linear · 2023-04-30T22:01:30.607Z · LW · GW

Another reason to expect approximate linearity in deep learning models: point 12 + arguments about approximate (linear) isomorphism between human and artificial representations (e.g. search for 'isomorph' in Understanding models understanding language and in Grounding the Vector Space of an Octopus: Word Meaning from Raw Text).

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Experimentally evaluating whether honesty generalizes · 2023-04-29T18:08:24.101Z · LW · GW

It seems to me the the results here that 'instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former' could be interpreted as some positive evidence for the optimistic case (and perhaps more broadly, for 'Do What I Mean' being not-too-hard); summary twitter thread, see especially tweets 4 and 5 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Deep learning models might be secretly (almost) linear · 2023-04-24T19:39:21.296Z · LW · GW

Linear decoding also works pretty well for others' beliefs in humans: Single-neuronal predictions of others’ beliefs in humans

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Lessons from Convergent Evolution for AI Alignment · 2023-03-28T12:41:06.418Z · LW · GW

Partial convergence between language models and brains and evolutionary analogy

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research · 2023-03-23T11:15:04.555Z · LW · GW

Probably not, from the paper: 'We used LeetCode in Figure 1.5 in the introduction, where GPT-4 passes all stages of mock interviews for major tech companies. Here, to test on fresh questions,
we construct a benchmark of 100 LeetCode problems posted after October 8th, 2022, which is after GPT-4’s pretraining period.'

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research · 2023-03-23T07:34:43.477Z · LW · GW

Table 2, page 21 -> (above) human-level performance on LeetCode.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers. · 2023-03-17T01:22:04.950Z · LW · GW

Yup, (something like) the human anchor seems surprisingly good as a predictive model when interacting with LLMs. Related, especially for prompting: Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning; A fine-grained comparison of pragmatic language understanding in humans and language models; Task Ambiguity in Humans and Language Models.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Plan for mediocre alignment of brain-like [model-based RL] AGI · 2023-03-14T00:14:53.320Z · LW · GW

Valence (and arousal) also seem relatively easy to learn even for current models e.g. The Perceptual Primacy of Feeling: Affectless machine vision models robustly predict human visual arousal, valence, and aesthetics; Quantifying Valence and Arousal in Text with Multilingual Pre-trained Transformers. And abstract concepts like 'human flourishing' could be relatively easy to learn even just from text e.g. Language is more abstract than you think, or, why aren't languages more iconic?; Artificial neural network language models align neurally and behaviorally with humans even after a developmentally realistic amount of training

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on The issue of meaning in large language models (LLMs) · 2023-03-12T11:14:03.090Z · LW · GW

Some relevant literature: Language is more abstract than you think, or, why aren't languages more iconic?, Meaning without reference in large language models, Grounding the Vector Space of an Octopus: Word Meaning from Raw Text, Understanding models understanding language, Implications of the Convergence of Language and Vision Model Geometries, Shared computational principles for language processing in humans and deep language models.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Is InstructGPT Following Instructions in Other Languages Surprising? · 2023-02-17T15:44:46.468Z · LW · GW

Recent works from Anders Søgaard might be relevant, e.g. Grounding the Vector Space of an Octopus: Word Meaning from Raw Text, Understanding models understanding language, Implications of the Convergence of Language and Vision Model Geometries.

E.g. from  Grounding the Vector Space of an Octopus: Word Meaning from Raw Text on the success of unsupervised machine translation (and more): 

'Consider, for example, the fact that unsupervised machine translation is possible (Lample et al., 2018a, b; Park et al., 2021). Unsupervised machine translation works by first aligning vector spaces induced by monolingual language models in the source and target languages (Søgaard et al., 2019). This is possible because such vector spaces are often near-isomorphic (Vulic et al., 2020). If weak supervision is available, we can use techniques such as Procrustes Analysis (Gower, 1975) or Iterative Closest Point (Besl & McKay, 1992), but aligments can be obtained in the absence of any supervision using adversarial learning (Li et al., 2019; Søgaard et al., 2019) or distributional evidence alone. If the vector spaces induced by language models exhibit high degrees of isomorphism to the physical world or human perceptions thereof, we have reason to think that similar techniques could provide us with sufficient grounding in the absence of supervision.

Unsupervised machine translation show that language model representations of different vocabularies of different languages are often isomorphic. Some researchers have also explored cross-modality alignment: (Chung et al., 2018) showed that unsupervised alignment of speech and written language is possible using the same techniques, for example. This also suggests unsupervised grounding should be possible.

Is there any direct evidence that language model vector spaces are isomorphic to (representations of) the physical world? There is certainly evidence that language models learn isomorphic representations of parts of vocabularies. Abdou et al. (2021), for example, present evidence that language models encode color in a way that is near-isomorphic to conceptual models of how color is perceived, in spite of known reporting biases (Paik et al., 2021). Patel and Pavlick (2022) present similar results for color terms and directionals. Liétard et al. (2021) show that the larger models are, the more isomorphic their representations of geographical place names are to maps of their physical location.'

'Unsupervised machine translation and unsupervised bilingual dictionary induction are evaluated over the full vocabulary, often with more than 85% precision. This indicates language models learn to represent concepts in ways that are not very language-specific. There is also evidence for near-isomorphisms with brain activity, across less constrained subsets of the vocabulary: (Wu et al., 2021), for example, show how brain activity patterns of individual words are encoded in a way that facilitates analogical reasoning. Such a property would in the limit entail that brain encodings are isomorphic to language model representations (Peng et al., 2020). Other research articles that seem to suggest that language model representations are generally isomorphic to brain activity patterns include (Mitchell et al., 2008; Søgaard, 2016; Wehbe et al., 2014; Pereira et al., 2018; Gauthier & Levy, 2019; Caucheteux & King, 2022).'

I'll probably write more about this soon.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme · 2023-01-26T19:11:55.117Z · LW · GW

It might be useful to have a look at Language models show human-like content effects on reasoning, they empirically test for human-like incoherences / biases in LMs performing some logical reasoning tasks (twitter summary thread; video presentation

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Large language models learn to represent the world · 2023-01-23T10:28:32.997Z · LW · GW

More evidence of something like world models in language models: Language models as agent models, Implicit Representations of Meaning in Neural Language Models

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Experiment Idea: RL Agents Evading Learned Shutdownability · 2023-01-17T21:30:30.541Z · LW · GW

It might be interesting to think if there could be connections to the framing of corrections in robotics e.g. “No, to the Right” – Online Language Corrections for Robotic Manipulation via Shared Autonomy 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Speculation on Path-Dependance in Large Language Models. · 2023-01-16T11:40:39.408Z · LW · GW

Excited to see people thinking about this! Importantly, there's an entire ML literature out there to get evidence from and ways to [keep] study[ing] this empirically. Some examples of the existing literature (also see Path dependence in ML inductive biases and How likely is deceptive alignment?): Linear Connectivity Reveals Generalization Strategies - on fine-tuning path-dependance, The Grammar-Learning Trajectories of Neural Language Models (and many references in that thread), Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets - on pre-training path-dependance. I can probably find many more references through my boorkmarks, if there's an interest for this.

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Basic Question about LLMs: how do they know what task to perform · 2023-01-15T00:28:21.798Z · LW · GW

Some frameworks/models: 'How does in-context learning work? A framework for understanding the differences from traditional supervised learning', 'A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks' 

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on Alignment as Translation · 2023-01-03T23:34:31.985Z · LW · GW

"The language model works with text. The language model remains the best interface I've ever used. It's user-friendly, composable, and available everywhere. It's easy to automate and easy to extend." - Text Is the Universal Interface

Comment by Bogdan Ionut Cirstea (bogdan-ionut-cirstea) on [Hebbian Natural Abstractions] Mathematical Foundations · 2022-12-26T13:23:06.356Z · LW · GW

This seems related and might be useful to you, especially (when it comes to Natural Abstractions) the section 'Linking Behavior and Neural Representations': 'A mathematical theory of semantic development in deep neural networks'