The Compleat Cybornaut

ukc10014

The Compleat Cybornaut

post by ukc10014, Jozdien, NicholasKees (nick_kees) · 2023-05-19T08:44:38.274Z · LW · GW · 2 comments

  So what is cyborgism?
  Background Concepts
    Why is in-context learning relevant?
    What’s the deal with base models? 
    Why are we talking about the “multiverse”?
    Why these frames?
    What is meant by “model psychology"?
  Practical cyborgism
    What are ‘variance friendly’ and ‘variance hostile’ domains and why is the distinction important?
    GPTs augment sapience, they don’t cure ignorance
    What is (are) Loom(s)?
    Ways of looming
    Some considerations re experimental setup
      Dialogue or monologue?
      Prompt once or throughout the conversation?
      Pruning is implicit conditioning
      Base to explore / RLHF to exploit
      Model-written prompts
      Looming parameters
    What is mode collapse and why does it suck?
  Example: situational awareness
  Conclusion
    Any specific relevance to alignment research?
    What’s the connection to evals?
    Will this matter under GPT-n?
None
2 comments

A cluster [? · GW] of conceptual frameworks and research programmes have coalesced around a 2022 post by janus [LW · GW], which introduced language models as ‘simulators’ (of other types of AIs such as agents, oracles, or genies). One such agenda, cyborgism [LW · GW], was coined in a post by janus and Nicholas Kees and is being researched as part of the 2023 editions of AI Safety Camp and SERI MATS. The objective of this document is to provide an on-ramp to the topic, one that is hopefully accessible to people not hugely familiar with simulator theory or language models.

So what is cyborgism?

Cyborgism proposes to use AIs, particularly language models (i.e. generative-pretrained transformers or GPTs), in ways that exploit their (increasingly) general-purpose intelligence, while retaining human control over the ‘dangerous bits’ of AI – i.e. agency, planning, and goal-formation. The overall objective is to leverage human cognitive ability [LW · GW] while minimising the risks associated with agentic [? · GW] AI.

Aside from agency, a core assertion of cyborgism is that certain commonly-used language models are not well-suited [LW · GW] to many tasks human users throw at them, but that humans, if appropriately-trained and equipped, might more effectively use GPTs in ways that are ‘natural’ [LW · GW] for the model, while dramatically increasing the productive and creative potential of the human.

Specifically, some current systems, such as ChatGPT, are released or predominantly used in a ‘tuned’ version, which has a host of shortcomings [AF · GW].^[1] One such tuning method, reinforcement-learning from human feedback (RLHF) has a specific weakness relevant to cyborgism: the tuning process severely limits, or collapses, a valuable aspect of the GPT, namely its wild, unconstrained creativity.

Superficially, the cyborgism approach may resemble a human-plus-oracle setup, but there is a subtle and important distinction: an oracle [LW · GW], it is argued [AF · GW], might ‘smuggle in’ some of the trappings [LW · GW] of an agent.^[2] In contrast, the human cyborg embeds the output of the language model into their own workflow and thinking - model and human work as an integrated system. The cyborg leverages the model’s creative, albeit non-agentic, potential while continuously ‘steering’ or ‘course-correcting’ the model to ensure its output remains relevant to the actual goal. However, cyborgism might entail a high alignment tax [? · GW]: absent appropriate workflows and tools, a setup consisting of a human plus non-agentic GPT might be considerably less productive than a purely agentic AI (as the human component becomes a bottleneck).

Background Concepts

Before getting into practical cyborgism, it is helpful to summarize some relevant theories and intuitions about how language models work.

Why is in-context learning relevant?

Neural networks generally, and language models specifically, go through several types of training: the large-scale (in terms of compute, time, and data) pre-training when all the neural weights are set in an end-to-end optimisation process; one or more fine-tuning rounds to focus the model on a specific use domain (during which the weights also change); and, in the case of certain models, including GPT-4, ChatGPT, and text-davinci-003, various types of supplementary tuning, which in the case of GPT-4 seems to include RLHF and rule-based reward modelling (RBRM [LW(p) · GW(p)]).

The final phase of training, known as ‘in-context learning’, happens during the session with the user, and doesn’t involve actual changes in neural weights, but does still significantly alter the type of output the model generates, based on the accumulated context of its interaction with an user in a given session. The mechanisms by which this happens are debated, but from a cyborgism perspective, the context provides a powerful [LW · GW] way of guiding or controlling the model’s generations.^[3]

What’s the deal with base models?

Base models are the versions of large language models before any kind of fine-tuning or other augmentation is applied. As such, they have been trained entirely to perform next token prediction on an extremely large and diverse set of text data (typically a web-scrape containing a significant fraction of the publicly accessible internet). All of the skill and knowledge representations acquired during this pre-training phase were formed because they were in some way instrumental for reducing the prediction error on this dataset, but crucially these models were never explicitly trained to be helpful or useful. This makes them capable of generating an extremely broad range of text, but also makes it really hard to elicit specific behavior from [LW · GW] them, requiring a lot of careful prompt engineering to control.

While these difficulties make them hard to use at first, they are often strongly preferred by people in the cyborgism community over fine-tuned models, for their breadth, creativity, and flexibility.

Why are we talking about the “multiverse”?

As an intuition pump Janus suggests the physics concept of a multiverse, i.e. the Everettian (or ‘many-worlds’) interpretation of quantum mechanics, as an analogy for what is happening when a user causes a GPT to generate text. GPT can be thought of as a ‘time evolution operator’ which takes in a state (all tokens in the context so far) and returns a probability distribution over next states (current tokens plus a single token completion). By rolling out a trajectory, iteratively sampling from these probability distributions, we are exploring a single branch of a multiverse implied by the model.

The ‘learning’ encoded into a GPT, whether those are neural weights or higher-level abstractions (e.g. ‘representations’, ‘concepts’, ‘goals’, etc.), can be thought of as a generative algorithm which can produce a (potentially very large) set of possible token-sequences, where: a) such sequences are probabilistically consistent with the training data (and any subsequent tuning), b) such ‘sequences’ can be big: on the scale of many pages over which they can often maintain modest conceptual and stylistic coherence, c) tokens in such sequences are auto-regressive, i.e. conditioned with an initial and subsequent user prompts as well as model responses to these prompts (context). This can be visualized as a tree of possibilities, each branch of which is, in any given rollout (i.e. interaction sequence with a user), mutually exclusive with respect to all other branches.

Another analogy, dubbed semiotic [? · GW] physics [LW · GW], proposes that GPTs represent a kind of ‘physics’ over a linguistic ‘universe’. Unlike that of our physical reality, which has an extremely large space of possible states and a relatively simple time evolution operator, this semiotic physics has a comparatively small space of states (e.g. sequence of token completions), but an extremely complex ‘time evolution operator’ containing all kinds of overlapping syntactic and semantic rules. The purpose of this frame is primarily to a) conceptualise and explore the model’s learned representations in more useful and richer ways than just ‘token sampling’, and b) re-consider GPT through borrowed ontologies, such as that of dynamic systems or of semantics [LW(p) · GW(p)]. The hope is that this may help establish more rigorous conceptions about model behavior while drawing on established formalisms like chaotic motion or attractor states.^[4]

Why these frames?

There's a fair amount [LW · GW] of discourse [LW · GW] surrounding the analogies given above - whether they describe these systems with high enough precision, the kinds of object-level claims they make, where they break, and so on. Some [LW · GW] have written [LW · GW] about how GPTs are better viewed simply as predictors, for instance, because it seemingly makes fewer questionable prescriptive claims on the model^[5].

The purpose of frames such as simulators or semiotic physics isn't, however, to make any differing object-level claims about the properties or nature of these models. Their sole utility is in being evocative.

GPTs are hilariously high-dimensional in their mechanics, and we lack any concrete method to speak about them at high levels of fidelity, having neither satisfactory formal understanding nor sufficient interpretability. Our choice is either to pick a very grounded frame with high precision - for instance, you could simply form deep, sophisticated views on their properties by simply viewing them through the way transformers work generally or as next-token predictors - or a frame that attempts to speak to high-level properties, such as simulators.

The key factor here isn't which of them is ultimately correct - neither can really model these high-dimensional structures strictly accurately at a conceptual level - but which of them evokes higher-quality predictions about the properties of those structures from within a human mind [LW(p) · GW(p)]. Certainly, it's likely that you can ultimately converge onto the same predictions from within any frame, but the utility of a frame is in which of them makes those predictions more accessible.

Almost by definition, this is a subjective thing - it's entirely plausible that viewing GPTs as predictors or probabilistic constraint satisfaction problem solvers makes high-level properties more intuitive to you than viewing them as simulators would have. There's likely some shared threshold here - it seems less likely a priori that thinking purely about the transformer mechanism would allow you to predict the properties of these models as they scale or how they interact with fine-tuning/RLHF before other frames, for instance - but ultimately the value of frames, the value of simulators or semiotic physics, is in how evocative they are to you. There are no normative claims here, except maybe that there's a lot of value in frames which aren't built from the bottom-up of the models as long as they're still evocative.

What is meant by “model psychology"?

Large neural networks like GPT are well-known to be “black boxes” which we don’t currently understand well - an issue which is being addressed in different ways. Just as neuroscience investigates the human brain in a bottom-up way by studying low-level neural processes, so too do alignment researchers aim to use mechanistic interpretability to better understand how neural networks work. A contrasting approach, model psychology - analogizing from its human namesake - aims instead to make sense of neural networks by studying behavior and inferring from that what goes on internally. Potential outcomes from this kind of research include the Waluigi Effect [LW · GW] or Anthropic’s model written evals project.

Practical cyborgism

Cyborgism and simulators were, as of early 2023, mostly theoretical speculations about the workings of LLMs, albeit informed by considerable practical investigations.^[6] An objective of AISC 2023 was to put these hunches and hypotheses on slightly firmer ground through interacting with models, documenting the team’s experience, and building tools and datasets that allow others to play around with cyborgism.

What are ‘variance friendly’ and ‘variance hostile’ domains and why is the distinction important?

When a language model generates text, that generated text (plus everything that came before) becomes the context (i.e. becomes the input for the next step in the autoregressive generation process) and therefore conditions future generations. Importantly, the GPT’s ‘internal workings’ (as of GPT-3) make no explicit distinction between the text in the original prompt and the additional generated text. Empirically, however, it has been observed that human-produced prompts (which often cluster at the start of an interaction, see below) and machine-made generations have a qualitatively different tone. In fact, as each generated token is added to the context, the distribution (which is being conditioned on this ever-shifting context) will drift causing denotative and connotative differences between the generated text and the original prompt to emerge, and as the interaction continues, become amplified. This can be thought of as an accumulation of error, but can alternatively be seen as an accumulation of “weirdness”, a subjective judgement but one that seems to be a feature of interactions with LLMs.

The variance friendly/hostile distinction refers to the fact that in some domains (e.g. writing fiction) there is a lot of flexibility for drift that doesn’t “break” what the human user was aiming to use GPT for; whereas other domains (e.g. scientific papers) are much more brittle, and small deviations from the original distribution strongly condition the model to stop producing useful output.

Put another way, in fiction, there are many ways the specific details of the story could change without drastically changing the type of text being generated (and in fact, the ‘weirdness’ referred to above might be occasionally desirable). When writing a scientific paper there is a very fine line between a good scientific paper and a terrible one, and a single mistake or false claim could strongly condition the model to begin producing correspondingly bad output (there is also an entangled subjective element: fails might condition the reader to downgrade the paper). Fiction provides a kind of buffer, where – because the domain isn’t as constrained – GPT can interpret variance as intentional choices by the author to take the story in a slightly different direction. There are certainly dimensions along which fiction isn’t very variance friendly either (for instance, if you switch to a poor writing style, the superposition could tend strongly toward a bad author), but there’s a quantitative distinction here.

However this distinction is not clear-cut. For instance, certain ‘fiction’ domains like poetry turn out to be difficult for current GPTs to handle (the poems they produce generally rhyme [LW(p) · GW(p)], and that in relatively un-inventive ways). There may also be non-fiction topics that, owing to their inherent imprecision or fuzziness, may be slightly more variance-friendly: within certain bounds (that depend on the problem and the preferences of the researcher), a model that tries out a variety of possible answers might still be useful. An example we looked at in AISC was moral philosophy; see also this exploration of coherent extrapolated volition (CEV) by janus, as well as their comments [LW · GW] on use cases and workflows.

GPTs augment sapience, they don’t cure ignorance

Cyborgists have observed that GPTs augment users’ own database [LW · GW] of knowledge and experience. This is intuitively defensible: a (base-model) GPT is trained on a large corpus encompassing much of the world’s digitized knowledge. However, to get the process started, the user often has to craft a skilful initial prompt that (empirically) is likely to elicit a useful answer. A subsequent step is to assess whether the model’s answer contains hallucinations or internal inconsistencies, and ask follow-on questions. The more ignorant a user is about a given domain, the more difficult this cyclical process of planning, research, oversight, and iteration might be. Again, this is closer to a feature than a bug, at least in a cyborgism framing: planning (as precursor and correlate of agentic behavior) is a capability the human user should retain.

In order to make prompting easier, a concrete hoped-for outcome of AISC 2023 is a library of prompts geared towards specific domains (fiction, philosophy, alignment, etc.) such that users, particularly those that are domain-knowledgeable, don’t have to spend hours teasing out useful work from the machine.

What is (are) Loom(s)?

Many users will have interacted with language models through chat-type interfaces such as ChatGPT, Bing’s chat feature, or specialized apps, 2D/3D avatars, etc. These interfaces have shortcomings - one may not be able to store or download interactions in a commonly-used format; they may also throttle the number of allowed interactions. A more substantial constraint is that certain interfaces, like ChatGPT, involve a base model (like code-davinci-002 or GPT-4) that has been ‘domesticated’ or ‘mode collapsed’ (less likely to spout nonsense, comment on politics, or racist/sexist/edgy content) through some further training, such as reinforcement learning from human feedback. These refinements may be entirely appropriate for a consumer product sold by a publicly-traded and regulated corporation, but constrain creative exploration of a topic or alignment research into the workings of language models.

In order to address these issues, janus made [LW · GW] an interface called Loom that allows for a potentially huge number of generations to be created, visualised in a tree and exported. It is useful to a cyborgism researcher because prompts (as well as parameters like completion length and temperature) can be varied in order to see how completions change.

There are many Loom-type tools, some of which are proprietary (Loom was originally built for Latitude/AI Dungeon and its successor Bonsai is Conjecture IP), while others like Python-Loom and Worldspider (a VS Code extension) are more available. With time there will be convergence on some open-source/public-access successor to these interfaces (note users will still need access to an actual language model, API keys, etc.)

Ways of looming

Loom-type tools are typically simple interfaces that do the following:

Allow the user to create prompts, whether at the start of an interaction with a model (analogous to the start of a ChatGPT session), or throughout the session (in response to the model’s generated text) in order to steer the interaction. Users can also edit or insert text in previous prompts and then re-generate the relevant downstream nodes.
Select such parameters as a model API requires (like model name, web endpoint, temperature, length of completion, number of desired completions), which allows experimentation.
The tool then packages the prompt/parameters into an API call, sends it off to the endpoint (i.e. the URL hosting the model), and waits for a response
The results of a series of interactions are shown graphically as a branching tree, which the user can explore by zooming into individual sections.
As stated above, the entire accumulated context is, subject to the constraint of the context-window size, sent to the model endpoint. This context therefore includes the initial (and any subsequent) prompts as well as the model’s generations, hence the model (or simulator) is conditioning on a combination of user prompts and its own previous generations. This fact might help account for the path dependency and odd behavior/weirdness of the base model.

Some considerations re experimental setup

Dialogue or monologue?

Base models differ from more user-friendly models (like the Instruct series e.g. text-davinci-003 or ChatGPT) in that base models tend not respond very well to a simple, short question or instruction. They easily repeat or ask questions or change topics (behaviors that are trained out from Instruct/RLHF models). Base models however seem to work better, in the sense of maintaining coherence, when the user demonstrates the behavior sought, e.g. by citing an example of a historical character saying something that resembles whatever type of answer the user is after. In my experience, they work particularly well when provided with a dialogue-type prompt.

Prompt once or throughout the conversation?

As stated above, a user can specify a more-or-less-long starting prompt only, and then guide generations by pruning off unwanted branches (i.e. not generating from or continuing those branches of the tree).

Another approach is to augment the starting prompt, with additional prompts [LW · GW] during a conversation to encourage or nudge the simulacra in a certain direction. These incremental prompts will also show up in the tree and exported data.

Pruning is implicit conditioning

Importantly, even if users do not explicitly add further prompts (beyond the initial one), the fact that the user is pruning the tree and choosing branches to continue generating from, is, via the context window, implicitly ‘passing information’ back into the simulator (‘manual iterative rejection sampling [LW · GW]’ in janus’ terminology). Another way of thinking of this is that pruning is conditioning or re-mapping the probabilities over which generations are made, even though it obviously doesn’t affect the training distribution.

Base to explore / RLHF to exploit

One practical approach to looming might be to work with both base and RLHF-tuned models, leveraging the creativity of the base model alongside the ‘well-behaved’ nature (better logical reasoning, factual recall, and/or reduced tendency to spout nonsense) of a tuned model. For instance, one might use a base model to generate a large number of branches (corresponding to research directions or diverse story-lines or questions about a topic), then pursue the most promising ones in a tuned model.^[7]

Model-written prompts

Another commonly used approach is to use language models to assist with their own prompt engineering. Humans are generally worse at writing than language models (in terms of style). They also inevitably leak all kinds of information in ways that are hard to both notice and avoid, buried in the specific word choice, vibe, or cadence of the text.

Looming parameters

Temperature: this is the most obvious parameter to vary to encourage more chaotic jumps in the content and tone of completions (0.5 to just above 1.0 is best, and users report that for temperatures above 1.5, the results are gibberish for most applications, while low temperatures might correlate with more repetitive completions).^[8]

Completion length: how many tokens are in each completion, where to set this is a matter of taste and specific use case; longer completions take more time (for the user) to read/evaluate, and might result in more chaotic formatting (particularly for dialogues). But they may also develop narratives in more interesting ways.

Number of completions: one approach in AISC has been to have 5-10 completions near the start of a tree to maximise variance, and then prune such that, a few levels into the tree, one is running <3 completions.^[9] This both speeds up the process, and keeps the model focused on whatever is the desired balance between creativity and cohesiveness. Sometimes deep in a tree, one might sense a promising filament that is stuck in some boring thread, in which case the user can increase the temperature or completion number to ‘unravel’ or draw out that filament.

Completions as a scarce resource: obviously, the choices above will be affected by the cost of completions e.g. as APIs are usage-limited or become more expensive, then one might be more stingy or conservative in the choices made. In particular, this might push one to generate fewer and longer completions. There is a semi-fixed cost of each API call, because each API call must "start over" feeding in the entire prompt from the beginning, and so generating many short completions in series is much more expensive than generating only a few longer ones.

What is mode collapse and why does it suck?

Mode collapse [LW · GW] is the tendency of certain models, like those refined with RLHF, to give answers that are stylized: the answers for a prompt become similar across multiple generations (even with a high temperature setting which should theoretically increase variance), more so than a directly comparable base model (janus’ example compares text-davinci-002 and its equivalent base model davinci). In some cases, but not always [LW · GW], the tuned model elicits a ‘corporate’, ‘boring’ (cleonardo’s [LW · GW] term), inoffensive, or excessively-hedged tone. See also this, from gwern [LW · GW], for intuitions for how RLHF-tuned and base models differ.

In terms of actual outputs, mode collapse can take many forms [LW · GW]: straightforward repeating of previous tokens/words; or a seeming cyclicality in responses which might look like sentences or pages of text being repeated. Note that mode collapse is somewhat subjective [LW · GW] i.e. it might be more obvious in some contexts than others; moreover, it is not absent from base models, and sometimes shows up late in a dialogue-type interaction.

At a deeper level, the dynamics underlying mode collapse aren't just present in base models, they're the very same dynamics [LW · GW] that power their cognition. Pre-training serves to collapse the universal agnostic prior onto a distribution representing our world, and RLHF/fine-tuning further narrows (and shifts) this prior onto certain posteriors which are very under-specified.^[10] The latter is more noticeable as "something going wrong" both because we don't really interface with the models at the right level of abstraction during tuning^[11] to understand the ramifications of the posterior shift and because the former distribution is quite literally the world we're familiar with, but at a deeper level the phenomenon is pretty similar.

Mode collapse is undesirable from the perspective of a cyborgist (who might be interested in eliciting a number of very different completions, which is after all why they are looming) or a researcher (who is trying to understand the behavior of the base model, or understand the shape of the ‘true’ training distribution, as opposed to a posterior that has been shaped by RLHF or some other tuning process).

Example: situational awareness

Meta: in the interests of keeping this post manageable, we have shelved the original idea of presenting situational awareness (note ‘situational awareness’ is being used in a somewhat unusual sense vis a vis other alignment contexts i.e. we mean ‘LLM simulacra realizing they are in a simulation’ rather than ‘some AI works out that it is an AI being trained/deployed by humans’) as an example of ‘cyborgism in practice’. If anyone is interested in specifics of my (Kanad’s) interactions with code-davinci-002/ChatGPT-4 in various contexts, please refer to this and let us know if you think it has value.

Conclusion

Any specific relevance to alignment research?

Are there concrete [LW · GW] ways in which cyborgism, which advocates an enthusiastic albeit cautious mixture of anthropomorphic and alien-intelligence framings of AI, help us develop better intuitions, tools, and workflows for alignment research?

In the narrowest sense, technical alignment research probably falls on the variance-unfriendly side of the friendly/unfriendly (fuzzy) border drawn above (i.e. there are probably not very many ways to design an aligned intelligence). However, actually designing an AI, or more prosaically, conducting an experiment/research project, has many aspects: literature review; finding a research methodology; designing a setup; writing up results.

At each stage, one can imagine a workflow (probably with custom-made tools) that would take advantage of a model’s ability to generate a range of possible completions, as an aid to creativity. For instance, literature review/summarization (which includes surveying research methodologies or experimental setups) and writing up results seems roughly within the scope of what current models can do. On the other hand, actually coming up with (decent/provable/viable) theories, theorems, or new experiments seems beyond current models. Moreover, current models are still prone to hallucination and factual errors, don’t have built-in web access, have limited memory, and have distinct training/learning and inference phases (i.e. do not learn after deployment, aside from the special/limited case of in-context learning) - hence they can’t be relied upon for precise technical or planning tasks. But it is possible [LW · GW] that some of these constraints might prove temporary.^[12]

If indeed future models can meaningfully conduct technical alignment research ‘on their own’, then the salience of cyborgism as an approach is to ensure there are guardrails/constraints on what they actually can do. There may be critical and non-obvious steps in an iterative research plan, that the human’s intuition or insight (for lack of better terms) contributes, which should not be abdicated to the model.

More broadly speaking, if instead of ‘technical alignment research’, one is interested in ‘AI safety’ (if the term is understood as including 'AI governance/ethics'), the ability to leverage model capabilities might increase significantly, simply owing to a larger space of research (i.e. sociology, philosophy, international relations, economics), often entangled as a wicked problem, which must be compressed, summarized, re-configured, or otherwise ‘learned from’.

There might also be a tooling aspect: consider an interface that checks every instruction the human researcher sends to the helper AI, looking out for potential agent-like behavior, and re-confirms with the researcher before adding a possibly hazardous task to the queue. This ‘screening interface’ might even have an enforcement function (alerting managers or regulators if a researcher is intentionally sending risky prompts to a powerful model). Since multiple users may be interacting with one or more models, this interface may also need to assess correlated risk: although any given task was designed to be singly non-agentic or non-risky, the all interactions jointly results in a dangerous mode emerging (unplanned agency, situational awareness, or coordination amongst instances).

What’s the connection to evals?

Cyborgism can be seen as a complement to model evaluation. Evaluations, or ‘evals’ [? · GW] are a broad class of approaches that allow researchers to predict how a model might perform when deployed, particularly when pushed into situations dissimilar to what it was trained in, or when used in a manner not foreseen by its creators. The goal is to assess how catastrophic the model’s failure might be.

Evaluation can conceptually be thought of as falling into two broad camps: understanding-based approaches that examine the inner workings of systems, create a detailed (in the sense of causal, mechanistic, mathematical, or engineering) model of the system’s internals, and use that understanding to predict future behavior.^[13] The other approach (within which cyborgism generally fits) is behavioral: use carefully-designed tests and scenarios to identify or predict unexpected or problematic tendencies, sort of like a ‘psychological evaluation’ of the system. This latter approach can be further split into alignment- and capability-relevant sub-approaches.

Neither agenda is particularly advanced, but an example of the latter approach can be found in ARC Evals [LW · GW] work for OpenAI, as outlined in Section 2 of the GPT-4 system card.

Will this matter under GPT-n?

As stated above, looming-type approaches seem to yield their greatest value in base models. In the future, there might be fewer state-of-the-art base models released, while the released (tuned) models may exhibit less obvious forms of mode collapse, stylized behavior, or other weaknesses [LW · GW] (vis a vis base models and use cases). Notwithstanding, there is likely to continue to be a rich ecosystem of open-source models (like BLOOM, GPT-J, HuggingGPT) in base form. Thus, we think of cyborgism more as a 'practice' or 'attitude' towards language models, which speaks to their current peculiarities but hopefully develops workflows and frames that remain relevant to their successors.

^{^}
See this relatively balanced take on RLHF from Paul Christiano [LW · GW]. The LessWrong tag page also has a broad [? · GW] set of views on the topic.
^{^}
However, the degree to which agentic behavior can also emerge from language models is a nuanced question: within the Simulator [LW · GW] framing, GPTs are capable of simulating an agent, and empirically-speaking, GPTs 3 and 4 clearly can make statements that indicate intentionality, agency, planning, or personality. However, they tend not to follow up on or ‘commit’ to those statements over long interactions. Moreover, whether they can act on plans or conceptions of self, in ways that matter (i.e. get on the internet and order a biological pathogen to be mixed up) is also evolving and is an area of investigation (e.g. ARC Evals’ work on GPT-4) but currently, there seems to be very limited planning or nonlinear-thinking abilities.
^{^}
The degree to which context can influence future generations is currently limited by the size of the context window (2048 tokens in GPT-3, up to 32,000 tokens in GPT-4, and 100,000 tokens for Anthropic’s Claude, as of May 2023). Although this is likely to be ameliorated in future models, current GPTs which operate with a fixed context-window need to be engineered to ensure relevant material doesn’t fall ‘out-of-context’. Approaches to do so include periodically summarising/compressing the accumulated context (probably using another instance of the model); or else factoring the top-level task, insofar as it is described in the context, into sub-tasks that are executed by additional instances of a model.
^{^}
As cleonardo [LW · GW] writes, semiotic physics tries to ‘bridge’ between one ontology, namely LLM Dynamic Theory (which treats them as Markov-like transitions between page-states of tokens) to another ontology, that of LLMs as Simulators (which uses simulacra, character traits, superpositions, waluigis, etc. to describe the model behavior cyborgists and GPT users observe). This bridge is not a formal thing, since the latter ‘ontology’ is not well-defined at this point, so any correspondence is purely based on observations of behavior. Despite being a work-in-progress, the hope is that a variety of such framings can help us understand and hopefully predict the behavioral tendencies of LLMs. This is a pragmatic target, given that we do not currently have explanatory adequacy [LW · GW] over the workings of LLMs: neither a complete mathematical theory nor satisfactory mechanistic interpretability.
^{^}
Note DragonGod's comment [LW(p) · GW(p)] on one of the linked posts, or Janus' succinct summary reply [LW(p) · GW(p)], on a comparison between the two frames.
^{^}
Carried out by Janus as well as others coordinating on the Cyborgism Discord server.
^{^}
This is an especially tentative statement, since (as of mid-April 2023), the AISC Cyborgism team have been working with code-davinci-002, a GPT-3 base model and then refining the outputs using ChatGPT-4 (which is a tuned version of GPT-4 and is trained on a different and somewhat newer dataset than GPT-3). GPT-4 and GPT-3 also have significant capability differences. So, for the purposes of this base/tuned hybrid approach, this isn't an apple-to-apples comparison between base and tuned versions of GPT-4.
^{^}
For using ChatGPT4 in a Loom environment users typically find it useful to increase the temperature to around 1.2-1.3. Lower temperatures are generally very deterministic (due to mode collapse), and higher temperatures produce word salad.
^{^}
See these comments by Janus on adaptive branching.
^{^}
There are two related phenomena taking place that are worth distinguishing: 1) a general loss of variance and 2) the development of powerful attractor states. A quality of collapsed models that their outputs are surprisingly resistant to perturbations of the input (relative to base models). This phenomenon of attraction is also present in base models, but their outputs are more likely to diverge, where small perturbations result in drastic changes to the trajectory (analogous to chaotic motion in a physical system).
^{^}
Specifically, RLHF (and to a lesser extent, fine-tuning) in its current form try to interface with the high-dimensional internals of the model through the much lower-dimensional channel of its outputs, leading to an under-specification problem.
^{^}
It is an open question, or at least out-of-scope for us, whether heavily scaffolded systems like AutoGPT can move significantly past the proof-of-concept stage and introduce planning and agentic capabilities to models (like GPT-3 or even GPT-4) that are (probably) not ‘natively’ trained as planners/agents.
^{^}
Note this is more commonly called ‘transparency’ or ‘interpretability’, but aside from nomenclature; it is possible that the border between understanding- and behavior-based evaluations will become increasingly blurry as this sub-field develops.

2 comments

Comments sorted by top scores.

comment by mishka · 2023-05-19T13:55:27.129Z · LW(p) · GW(p)

Thanks for the great post!

In the future, there might be fewer state-of-the-art base models released

Note that Sam Altman seems to have promised access to base-GPT-4 model to researchers:

sama March 22 tweet

The OpenAI Researcher Access Program application notes specifically:

The GPT-4 base model is currently being made available to a limited subset of researchers who are studying alignment or the risks and impact of AI systems.

I hope that more researchers in this subset apply for access.

I also hope that people who apply would also inform the community about the status of such applications: is access actually being granted (and if not, is there a response at all), what are the restrictions in terms of the ability to use loom-like tools (which tend to be more compute-intensive compared to pedestrian use), what are the restrictions if any in terms of the ability to share results, etc.

comment by janus · 2023-05-20T13:52:52.922Z · LW(p) · GW(p)

Awesome post! I've added it to the Cyborgism sequence [? · GW].

One comment:

it's entirely plausible that viewing GPTs as predictors or probabilistic constraint satisfaction problem solvers makes high-level properties more intuitive to you than viewing them as simulators

I disagree with the implied mutual exclusivity of viewing GPTs as predictors, probabilistic constraint satisfaction problem solvers, and simulators. A deep/holistic understanding of self-supervised simulators entails a model of probabilistic constraint solvers, a deep/holistic understanding of prediction (+ sampling) entails simulation, etc. Several of my sadly still unpublished posts in the Simulators sequence elaborate on the probabilistic boundary value problem solver analogy. Going through the "probabilistic laws of physics" analogy is a simple way to see how is equivalent to the (semiotic physics) simulators frame.

The Compleat Cybornaut

Contents

So what is cyborgism?

Background Concepts

Why is in-context learning relevant?

What’s the deal with base models?

Why are we talking about the “multiverse”?

Why these frames?

What is meant by “model psychology"?

Practical cyborgism

What are ‘variance friendly’ and ‘variance hostile’ domains and why is the distinction important?

GPTs augment sapience, they don’t cure ignorance

What is (are) Loom(s)?

Ways of looming

Some considerations re experimental setup

Dialogue or monologue?

Prompt once or throughout the conversation?

Pruning is implicit conditioning

Base to explore / RLHF to exploit

Model-written prompts

Looming parameters

What is mode collapse and why does it suck?

Example: situational awareness

Conclusion

Any specific relevance to alignment research?

What’s the connection to evals?

Will this matter under GPT-n?

2 comments