Posts

You should go to ML conferences 2024-07-24T11:47:52.214Z
The Living Planet Index: A Case Study in Statistical Pitfalls 2024-06-24T10:05:55.101Z
Announcing Human-aligned AI Summer School 2024-05-22T08:55:10.839Z
InterLab – a toolkit for experiments with multi-agent interactions 2024-01-22T18:23:35.661Z
Box inversion revisited 2023-11-07T11:09:36.557Z
Snapshot of narratives and frames against regulating AI 2023-11-01T16:30:19.116Z
We don't understand what happened with culture enough 2023-10-09T09:54:20.096Z
Elon Musk announces xAI 2023-07-13T09:01:01.278Z
Talking publicly about AI risk 2023-04-21T11:28:16.665Z
The self-unalignment problem 2023-04-14T12:10:12.151Z
Why Simulator AIs want to be Active Inference AIs 2023-04-10T18:23:35.101Z
Lessons from Convergent Evolution for AI Alignment 2023-03-27T16:25:13.571Z
The space of systems and the space of maps 2023-03-22T14:59:05.258Z
Cyborg Periods: There will be multiple AI transitions 2023-02-22T16:09:04.858Z
The Cave Allegory Revisited: Understanding GPT's Worldview 2023-02-14T16:00:08.744Z
Deontology and virtue ethics as "effective theories" of consequentialist ethics 2022-11-17T14:11:49.087Z
We can do better than argmax 2022-10-10T10:32:02.788Z
Limits to Legibility 2022-06-29T17:42:19.338Z
Continuity Assumptions 2022-06-13T21:31:29.620Z
Announcing the Alignment of Complex Systems Research Group 2022-06-04T04:10:14.337Z
Case for emergency response teams 2022-04-05T12:45:08.371Z
Hinges and crises 2022-03-29T11:11:03.605Z
Experimental longtermism: theory needs data 2022-03-24T08:23:40.454Z
Risk Map of AI Systems 2020-12-15T09:16:46.852Z
Epistea Workshop Series: Epistemics Workshop, May 2020, UK 2020-02-28T10:37:34.229Z
Epistea Summer Experiment (ESE) 2020-01-24T10:49:35.228Z
Epistea Summer Experiment 2019-05-13T21:29:43.681Z
Isaac Asimov's predictions for 2019 from 1984 2018-12-28T09:51:09.951Z
Multi-agent predictive minds and AI alignment 2018-12-12T23:48:03.155Z
CFAR reunion Europe 2018-11-27T12:02:36.359Z
Why it took so long to do the Fermi calculation right? 2018-07-02T20:29:59.338Z
Dissolving the Fermi Paradox, and what reflection it provides 2018-06-30T16:35:35.171Z
Effective Thesis meetup 2018-05-31T19:49:56.285Z
Far future, existential risk, and AI alignment 2018-05-10T09:51:43.278Z
Review of CZEA "Intense EA Weekend" retreat 2018-04-05T23:04:09.398Z
Brno: Far future, existential risk and AI safety 2018-04-02T19:11:06.375Z
Life hacks 2018-04-01T10:29:20.023Z
Welcome to LessWrong Prague [Edit With Your Details] 2018-04-01T10:23:36.557Z
Reward hacking and Goodhart’s law by evolutionary algorithms 2018-03-30T07:57:05.238Z
Optimal level of hierarchy for effective altruism? 2018-03-27T22:38:27.967Z
GoodAI announced "AI Race Avoidance" challenge with $15k in prize money 2018-01-18T18:05:09.811Z
Nonlinear perception of happiness 2018-01-08T09:04:15.314Z

Comments

Comment by Jan_Kulveit on You should go to ML conferences · 2024-07-25T12:37:37.299Z · LW · GW

I'm skeptical of the 'wasting my time' argument.

Stance like 'going to poster sessions is great for young researchers, I don't do it anymore and just meet friends' is high-status, so, on priors, I would expect people to take it more than what's optimal.

Realistically, poster session is ~1.5h, maybe 2h with skimming what to look at. It is relatively common for people in AI to spend many hours per week digesting what are the news on twitter. I really doubt the per hour efficiency of following twitter is better than of poster sessions when approached intentionally. (While obviously aimlessly wandering between endless rows of posters is approximately useless.)

Comment by Jan_Kulveit on You should go to ML conferences · 2024-07-24T16:43:06.374Z · LW · GW

Corrected!

Comment by Jan_Kulveit on The last era of human mistakes · 2024-07-24T12:00:43.173Z · LW · GW

I broadly agree with this - we tried to describe somewhat similar set of predictions in Cyborg periods.

Comment by Jan_Kulveit on List of Collective Intelligence Projects · 2024-07-02T21:15:12.711Z · LW · GW

Surprised you haven't heard about any facilitated communication tools. 

Comment by Jan_Kulveit on LLM Generality is a Timeline Crux · 2024-06-24T22:23:46.931Z · LW · GW

Few thoughts
- actually, these considerations mostly increase uncertainty and variance about timelines; if LLMs miss some magic sauce, it is possible smaller systems with the magic sauce could be competitive, and we can get really powerful systems sooner than Leopold's lines predict
- my take on what is one important thing which makes current LLMs different from humans is the gap described in Why Simulator AIs want to be Active Inference AIs; while that post intentionally avoids having a detailed scenario part, I think the ontology introduced is better for thinking about this than scaffolding
- not sure if this is clear to everyone, but I would expect the discussion of unhobbling being one of the places where Leopold would need to stay vague to not breach OpenAI confidentiality agreements; for example, if OpenAI was putting a lot of effort into make LLM-like systems be better at agency, I would expect he would not describe specific research and engineering bets

Comment by Jan_Kulveit on TsviBT's Shortform · 2024-06-19T13:20:53.324Z · LW · GW

Agreed we would have to talk more. I think I mostly get the homunculi objection. Don't have time now to write an actual response, so here are some signposts:
- part of what you call agency is explained by roughly active inference style of reasoning
-- some type of "living" system is characteristic by having boundaries between them and the environment (boundaries mostly in sense of separation of variables)
-- maintaining the boundary leads to need to model the environment
-- modelling the environment introduces a selection pressure toward approximating Bayes
- other critical ingredient is boundedness
-- in this universe, negentropy isn't free
-- this introduces fundamental tradeoff / selection pressure for any cognitive system: length isn't free, bitflips aren't free, etc.
(--- downstream of that is compression everywhere, abstractions)
-- empirically, the cost/returns function for scaling cognition usually hits diminishing returns, leading to minds where it's not effective to grow the single mind further
--- this leads to the basin of convergent evolution I call "specialize and trade"
-- empirically, for many cognitive systems, there is a general selection pressure toward modularity
--- I don't know what are all the reasons for that, but one relatively simple is 'wires are not free'; if wires are not free, you get colocation of computations like brain regions or industry hubs
--- other possibilities are selection pressures from CAP theorem, MVG, ...
(modularity also looks a bit like box-inverted specialize and trade)

So, in short, I think where I agree with the spirit of If humans didn't have a fixed skull size, you wouldn't get civilization with specialized members and my response is there seems to be extremely general selection pressure in this direction. If cells were able to just grow in size and it was efficient, you wouldn't get multicellulars. If code bases were able to just grow in size and it was efficient, I wouldn't get a myriad of packages on my laptop, it would all be just kernel. (But even if it was just kernel, it seems modularity would kick in and you still get the 'distinguishable parts' structure.)

Comment by Jan_Kulveit on TsviBT's Shortform · 2024-06-17T01:23:41.073Z · LW · GW

That's why solving hierarchical agency is likely necessary for success

Comment by Jan_Kulveit on Former OpenAI Superalignment Researcher: Superintelligence by 2030 · 2024-06-05T09:17:39.212Z · LW · GW

(crossposted from twitter) Main thoughts: 
1. Maps pull the territory 
2. Beware what maps you summon 

Leopold Aschenbrenners series of essays is a fascinating read: there is a ton of locally valid observations and arguments. Lot of the content is the type of stuff mostly discussed in private. Many of the high-level observations are correct.

At the same time, my overall impression is the set of maps sketched pulls toward existential catastrophe, and this is true not only for the 'this is how things can go wrong' part, but also for the 'this is how we solve things' part. Leopold is likely aware of the this angle of criticism, and deflects it with 'this is just realism' and 'I don't wish things were like this, but they most likely are'. I basically don't buy that claim.

Comment by Jan_Kulveit on The Alignment Problem No One Is Talking About · 2024-05-17T11:31:10.715Z · LW · GW

You may be interested in 'The self-unalignment problem' for some theorizing https://www.lesswrong.com/posts/9GyniEBaN3YYTqZXn/the-self-unalignment-problem

Comment by Jan_Kulveit on Examples of Highly Counterfactual Discoveries? · 2024-04-24T14:04:42.741Z · LW · GW

Mendel's Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths

Comment by Jan_Kulveit on GPTs are Predictors, not Imitators · 2024-04-22T11:43:02.633Z · LW · GW

I do agree the argument "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?" is wrong and clearly the answer is "Nope". 

At the same time I do not think parts of your argument in the post are locally valid or good justification for the claim.

Correct and locally valid argument why GPTs are not capped by human level was already written here.

In a very compressed form, you can just imagine GPTs have text as their "sensory inputs" generated by the entire universe, similarly to you having your sensory inputs generated by the entire universe. Neither human intelligence nor GPTs are constrained by the complexity of the task (also: in the abstract, it's the same task).  Because of that, "task difficulty" is not a promising way how to compare these systems, and it is necessary to look into actual cognitive architectures and bounds. 

With the last paragraph, I'm somewhat confused by what you mean by "tasks humans evolved to solve". Does e.g. sending humans to the Moon, or detecting Higgs boson, count as a "task humans evolved to solve" or not? 

Comment by Jan_Kulveit on FHI (Future of Humanity Institute) has shut down (2005–2024) · 2024-04-18T16:23:47.194Z · LW · GW

I sort of want to flag this interpretation of whatever gossip you heard seems misleading/only telling small part of the story, based on my understanding.

Comment by Jan_Kulveit on Express interest in an "FHI of the West" · 2024-04-18T11:15:28.381Z · LW · GW

I would imagine I would also react to it with smile in the context of an informal call. When used as brand / "fill interest form here" I just think it's not a good name, even if I am sympathetic to proposals to create more places to do big picture thinking about future.

Comment by Jan_Kulveit on Express interest in an "FHI of the West" · 2024-04-18T08:33:24.300Z · LW · GW

Sorry, but I don't think this should be branded as "FHI of the West".

I don't think you personally or Lightcone share that much of an intellectual taste with FHI or Nick Bostrom - Lightcone seems firmly in the intellectual tradition of Berkeley, shaped by orgs like MIRI and CFAR. This tradition was often close to FHI thoughts, but also quite often at tension with it. My hot take is you particularly miss part of the generators of the taste which made FHI different from Berkeley. I sort of dislike the "FHI" brand being used in this way.

edit: To be clear I'm strongly in favour of creating more places for FHI-style thinking, just object to the branding / "let's create new FHI" frame. Owen expressed some of the reasons better and more in depth

Comment by Jan_Kulveit on Why Simulator AIs want to be Active Inference AIs · 2024-02-04T20:17:28.840Z · LW · GW

You are exactly right that active inference models who behave in self-interest or any coherently goal-directed way must have something like an optimism bias.

My guess about what happens in animals and to some extent humans: part of the 'sensory inputs' are interoceptive, tracking internal body variables like temperature, glucose levels, hormone levels, etc. Evolution already built a ton of 'control theory type cirquits' on the bodies (an extremely impressive optimization task is even how to build a body from a single cell...). This evolutionary older circuitry likely encodes a lot about what the evolution 'hopes for' in terms of what states the body will occupy. Subsequently, when building predictive/innocent models and turning them into active inference, my guess a lot of the specification is done by 'fixing priors' of interoceptive inputs on values like 'not being hungry'.  The later learned structures than also become a mix between beliefs and goals: e.g. the fixed prior on my body temperature during my lifetime leads to a model where I get 'prior' about wearing a waterproof jacket when it rains, which becomes something between an optimistic belief and 'preference'.  (This retrodicts a lot of human biases could be explained as "beliefs" somewhere between "how things are" and "how it would be nice if they were")


But this suggests an approach to aligning embedded simulator-like models: Induce an optimism bias such that the model believes everything will turn out fine (according to our true values)
 

My current guess is any approach to alignment which will actually lead to good outcomes must include some features suggested by active inference. E.g. active inference suggests something like 'aligned' agent which is trying to help me likely 'cares' about my 'predictions' coming true, and has some 'fixed priors' about me liking the results. Which gives me something avoiding both 'my wishes were satisfied, but in bizarre goodharted ways' and 'this can do more than I can'

Comment by Jan_Kulveit on What rationality failure modes are there? · 2024-01-20T00:07:47.140Z · LW · GW


- Too much value and too positive feedback on legibility. Replacing smart illegible computations with dumb legible stuff
- Failing to develop actual rationality and focusing on cultivation of the rationalist memeplex  or rationalist culture instead
- Not understanding the problems with the theoretical foundations on which sequences are based (confused formal understanding of humans -> confused advice)

Comment by Jan_Kulveit on Tyranny of the Epistemic Majority · 2024-01-19T00:59:56.706Z · LW · GW

+1 on the sequence being on the best things in 2022. 

You may enjoy additional/somewhat different take on this from population/evolutionary biology (and here). (To translate the map you can think about yourself as the population of myselves. Or, in the opposite direction, from a gene-centric perspective it obviously makes sense to think about the population as a population of selves)

Part of the irony here is evolution landed on the broadly sensible solution (geometric rationality). Hower, after almost every human doing the theory got somewhat confused by the additive linear EV rationality maths, what most animals and also often humans on S1 level do got interpreted as 'cognitive bias' - in the spirit of assuming obviously stupid evolution not being able to figure out linear argmax over utility algorithms in a a few billion years

I guess not much engagement is caused by
- the relation between 'additive' vs 'multiplicative' picture being deceptively simple in formal way
- the conceptual understanding of what's going on and why being quite tricky; one reason is I guess our S1 / brain hardware runs almost entirely in the multiplicative / log world; people train their S2 understanding on linear additive picture; as Scott explains, maths formalism fails us

Comment by Jan_Kulveit on Limits to Legibility · 2024-01-15T08:58:32.896Z · LW · GW

This is a short self-review, but with a bit of distance, I think understanding 'limits to legibility' is one of the maybe top 5 things an aspiring rationalist should deeply understand and lack of this leads to many bad outcomes in both rationalist and EA communities.

In a very brief form, maybe the most common cause of EA problem and stupidities are attempts to replace illegible S1 boxes able to represent human values such as 'caring' by legible, symbolically described, verbal moral reasoning subject to memetic pressure.

Maybe the most common cause of rationalist problems and difficulties with coordination are cases where people replace illegible smart S1 computations with legible S2 arguments.

Comment by Jan_Kulveit on The shard theory of human values · 2024-01-15T08:34:08.497Z · LW · GW

In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community.

The upsides
- majority of the claims is true or at least approximately true
- "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as 'shards'
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem 

The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)

Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about. 

Comment by Jan_Kulveit on Deontology and virtue ethics as "effective theories" of consequentialist ethics · 2024-01-12T00:12:24.440Z · LW · GW

My current view is this post is decent at explaining something which is "2nd type of obvious" in a limited space, using a physics metaphor.  What is there to see is basically given in the title: you can get a nuanced understanding of the relations between deontology, virtue ethics and consequentialism using the frame of "effective theory" originating in physics, and using "bounded rationality" from econ.

There are many other ways how to get this: for example, you can read hundreds of pages of moral philosophy, or do a degree in it.  Advantage of this text is you can take a shortcut and get the same using the physics metaphorical map. The disadvantage is understanding how effective theories work in physics is a prerequisite, which quite constrains the range of people to which this is useful, and the broad appeal. 

 

Comment by Jan_Kulveit on Where I agree and disagree with Eliezer · 2024-01-09T01:44:37.777Z · LW · GW

This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent.

I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'.

Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.

Comment by Jan_Kulveit on Human values & biases are inaccessible to the genome · 2023-12-18T04:59:37.185Z · LW · GW

The post is influential, but makes multiple somewhat confused claims and led many people to become confused. 

The central confusion stems from the fact that genetic evolution already created a lot of control circuitry before inventing cortex, and did the obvious thing to 'align' the evolutionary newer areas: bind them to the old circuitry via interoceptive inputs. By this mechanism, genome is able to 'access' a lot of evolutionary relevant beliefs and mental models. The trick is the higher/more distant to genome models are learned in part to predict interoceptive inputs (tracking evolutionary older reward circuitry), so they are bound by default, and there isn't much independent to 'bind'. Anyone can check this... just thinking about a dangerous looking person with a weapon activates older, body-based fear/fight chemical regulatory circuits => the active inference machinery learned this and plans actions to avoid these states.

 

Comment by Jan_Kulveit on Limits to Legibility · 2023-12-18T04:30:23.658Z · LW · GW
Comment by Jan_Kulveit on Mapping the semantic void: Strange goings-on in GPT embedding spaces · 2023-12-15T05:30:45.575Z · LW · GW

Speculative guess about the semantic richness: the embeddings at distances like 5-10 are typical to concepts which are usually represented by multi token strings. E.g. "spotted salamander" is 5 tokens. 

Comment by Jan_Kulveit on How do you feel about LessWrong these days? [Open feedback thread] · 2023-12-08T16:25:28.422Z · LW · GW

I like the agree-disagree vote and the design.

With the content and votes...
- my impression is until ~1-2 years ago LW had a decent share of great content; I disliked the average voting "taste vector", which IMO represented somewhat confused taste in roughly "dumbed down MIRI views" direction. I liked many of the discourse norms
- not sure what exactly happened, but my impression is LW is often just another battlefield in 'magical egregore war zone'. (It's still way better than other online public spaces)

What I mean by that is a lot of people seemingly moved from 'let's figure out how things are' into 'texts you write are elaborate battle moves in egregore warfare''. Don't feel excited about pointing to examples, but impression are ...growing share of senior top-ranking users who seem hard to convince about anything, can not be bothered to actually engage with arguments, writing either literal manifestos or in manifesto-style.

Comment by Jan_Kulveit on Complex systems research as a field (and its relevance to AI Alignment) · 2023-12-07T10:14:37.843Z · LW · GW

(high-level comment)

To me, it seems this dialogue diverged a lot into a question of what is self-referential, how important that is, etc. I don't think that's The core idea of complex systems, and does not seem a crux for anything in particular.

So, what are core ideas of complex systems? In my view:

1. Understanding that there is this other direction (complexity) physics can expand to; traditionally, physics has expanded in scales of space, time, and energy - starting from everyday scales of meters, seconds, and kgs, gradually understanding the world on more and more distant scales.

While this was super successful, with a careful look, you notice that while we had claims like 'we now understand deeply how the basic building blocks of matter behave', this comes with a * disclaimer/footnote like 'does not mean we can predict anything if there are more of the blocks and they interact in nontrivial ways'.

This points to some other direction in the space of stuff to apply physics way of thinking than 'smaller', 'larger', 'high energy', etc., and also different than 'applied'.

 Accordingly, good complex systems science is often basically the physics way of thinking applied to complex systems. Parts of statistical mechanics fit neatly into this, but because being developed first, have somewhat specific brand.

Why it isn't done just under the brand of 'physics' seems based on, in my view, often problematic way of classifying fields by subject of study, and not by methods. I know some personal experiences of people who tried to do, e.g., physics of some phenomena in economic systems, having a hard time to survive in traditionally physics academic environments ("does it really belong here if instead of electrons you are now applying it to some ...markets?")

(This is not really strict; for example, decent complex systems research is often published in venues like Physica A, which is nominally about Statistical Mechanics and its Applications)

2. 'Physics' in this direction often stumbled upon pieces of math that are broadly applicable in many different contexts. (This is actually pretty similar to the rest of physics, where, for example, once you have the math of derivatives, or math of groups, you see them everywhere.) The historically most useful pieces are e.g., math of networks, statistical mechanics, renormalization, parts of entropy/information theory, phase transitions,...

Because of the above-mentioned (1.), it's really not possible to show 'how is this a distinct contribution of complex systems science, in contrast to just doing physics of nontraditional systems'. Actually, if you look at the 'poster children' of some of the 'complex systems science'... my maximum likelihood estimate about their background is physics. (Just googled authors of the mentioned book: Stefan Thurner... obtained a PhD in theoretical physics, worked on e.g., topological excitations in quantum field theories, statistics and entropy of complex systems. Petr Klimek... was awarded a PhD in physics. Albert-László Barabási... has a PhD in physics. Doyne Farmer... University of California, Santa Cruz, where he studied physical cosmology etc. etc.). Empirically they prefer the brand of complex systems vs. just physics.

3. Part of what distinguishes complex systems [science / physics / whatever ... ] is in aesthetics. (Also here it becomes directly relevant to alignment).

A lot of traditional physics and maths basically has a distaste toward working on problems which are complex, too much in the direction of practical relevance, too much driven by what actually matters.

Mentioned Albert-László Barabási got famous for investigating properties of real-world networks, like the internet or transport networks. Many physicists would just not work on this because it's clearly 'computer science' or something, as the subject are computers or something like that. Discrete maths people studying graphs could have discovered the same ideas a decade earlier ... but my inner sim of them says studying the internet is distasteful. It's just one graph, not some neatly defined class of abstract objects. It's data-driven. There likely aren't any neat theorems. Etc.

Complex systems has an opposite aesthetics: applying math to real-world matters. Important real-world systems are worth studying also because of real-world importance, not just math beauty.

In my view AI safety would be a on a better track if this taste/aesthetics was more common. What we have now often either lacks what's good about physics (aim for somewhat deep theories which generalize) or lacks what's good about complexity science branch of physics (reality orientation, assumption that you often find cool math when looking at reality carefully vs. when just looking for cool maths)

Comment by Jan_Kulveit on What's next for the field of Agent Foundations? · 2023-12-01T10:21:15.922Z · LW · GW

These are especially common, surprisingly perhaps, in AI and ML departments.


This is somewhat unsurprising given human psychology. 
- Scaling up LLMs killed a lot of research agendas inside ML, particularly NLP.  Imagine your whole research career was built on improving benchmarks on some NLP problem using various clever ideas. Now, the whole thing is better solved by three sentence prompt to GPT4 and everything everyone in the subfield worked on is irrelevant for all practical purposes... how do you feel? In love with scaled LLMs?
- Overall, people often like about research is coming up with smart ideas, and there is some aesthetics going into it.  What's traditionally not part of the aesthetics is 'and you also need to get $100M in compute', and it's reasonably to model a lot of people as having a part which hates this. 

Comment by Jan_Kulveit on Public Call for Interest in Mathematical Alignment · 2023-12-01T10:03:56.606Z · LW · GW

Part of ACS research directions fits into this - Hierarchical Agency, Active Inference based pointers to what alignmnent means, Self-unalignment

Comment by Jan_Kulveit on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense · 2023-11-30T14:19:20.367Z · LW · GW

The simple math is active inference, and the type is almost entirely the same as 'beliefs'. 

Comment by Jan_Kulveit on Value systematization: how values become coherent (and misaligned) · 2023-10-29T13:18:12.138Z · LW · GW

My impression is you get a lot of "the later" if you run "the former" on the domain of language and symbolic reasoning, and often the underlying model is still S1-type. E.g.

rights inherent & inalienable, among which are the preservation of life, & liberty, & the pursuit of happiness
 

does not sound to me like someone did a ton of abstract reasoning to systematize other abstract values, but more like someone succeeded to write words which resonate with the "the former".

Also, I'm not sure why do you think the later is more important for the connection to AI. Curent ML seem more similar to "the former", informal, intuitive, fuzzy reasonining.
 

Re self-unalignment: that framing feels a bit too abstract for me; I don't really know what it would mean, concretely, to be "self-aligned". I do know what it would mean for a human to systematize their values—but as I argue above, it's neither desirable to fully systematize them nor to fully conserve them. 

That's interesting - in contrast, I have a pretty clear intuitive sense of a direction where some people have a lot of internal conflict and as a result their actions are less coherent, and some people have less of that.

In contrast I think in case of humans who you would likely describe as 'having systematized there values' ... I often doubt what's going on.  A lot people who describe themselves as hardcore utilitarians seem to be ... actually not that, but more resemble a system where somewhat confused verbal part fights with other parts, which are sometimes suppressed.

Identifying whether there's a "correct" amount of systematization to do feels like it will require a theory of cognition and morality that we don't yet have.

That's where I think looking at what human brains are doing seems interesting. Even if you believe the low-level / "the former" is not what's going with human theories of morality, the technical problem seems very similar and the same math possibly applies 

Comment by Jan_Kulveit on Value systematization: how values become coherent (and misaligned) · 2023-10-27T22:35:14.074Z · LW · GW

"Systematization" seems like either a special case of the Self-unalignment problem

In humans, it seems the post is somewhat missing what's going on. Humans are running something like this


...there isn't any special systematization and concretization process. All the time, there are models running at different levels of the hierarchy, and every layer tries to balance between prediction errors from more concrete layers, and prediction errors from more abstract layers.

How does this relate to "values" ... from low-level sensory experience of cold, and fixed prior about body temperature, the AIF system learns more abstract and general "goal-belief" about the need to stay warm, and more abstract sub-goals about clothing, etc. At the end there is a hierarchy of increasingly abstract "goal-beliefs" what I do, expressed relative to the world model.

What's worth to study here is  how human brains manage to keep the hierarchy mostly stable

Comment by Jan_Kulveit on We don't understand what happened with culture enough · 2023-10-10T12:47:34.418Z · LW · GW

Absent symbolic language, none of these are capable of transmitting significant general purpose world knowledge, and thus are irrelevant for the techno-cultural criticality.


It's likely literally not true, but if it was ... this proves my point, doesn't it? 

"Symbolic language" is exactly the type of innovation which can be discontinuous, has a type "code" more than "data quantity", and unlocks many other things. For example more rapid and robust horizontal synchronization of brains (eg when hunting). Or yes, jump in effective quantity of information transmitted via other signals in time.

At the same time ...could be clearly discontinuous: you can teach actual apes sign language, and it seems plausible this would make them more fit, if done in the wild. 

(It's actually somewhat funny that Eric Drexler has a hundred page report based exactly on the premise "AI models using human language is obviously stupid inefficiency, and you can make a jump in efficiency with more native-architecture-friendly format".

This does not seem obviously stupid: e.g, now, if you want one model to transfer some implicit knowledge it learned, the way to do it is use the ML-native model to generate shitload of human natural language examples, and train the other model on it, building the native representation again.)

Comment by Jan_Kulveit on We don't understand what happened with culture enough · 2023-10-10T12:27:50.013Z · LW · GW

I'll try to keep it short
 

All the cross-generational information channels you highlight are at rough saturation, so they're not able to contribute to the cross-generational accumulation of capabilities-promoting information.

This seems clearly contradicted by empirical evidence. Mirror neurons would likely be able to saturate what you assume is brains learning rate, so not transferring more learned bits is much more likely because marginal cost of doing so is higher than than other sensible options. Which is a different reason than "saturated, at capacity".
 

Firstly, I disagree with your statement that other species have "potentially unbounded ways how to transmit arbitrary number of bits". Taken literally, of course there's no species on earth that can actually transmit an *unlimited* amount of cultural information between generations

Sure. Taken literally, the statement is obviously false ... literally nothing can store arbitrary number of bits because of Bekenstein bound. More precisely, the claim is existing non-human ways how to transmit leaned bits to the next generation in practice do not seem to be constrained by limits how many bits they can transmit, but by some other limits (e.g. you can transmit more bits than the capacity of the animal to learn).
 

Secondly, the main point of my article was not to determine why humans, in particular, are exceptional in this regard. The main point was to connect the rapid increase in human capabilities relative to previous evolution-driven progress rates with the greater optimization power of brains as compared to evolution. Being so much better at transmitting cultural information as compared to other species allowed humans to undergo a "data-driven singularity" relative to evolution. While our individual brains and learning processes might not have changed much between us and ancestral humans, the volume and quality of data available for training future generations did increase massively, since past generations were much better able to distill the results of their lifetime learning into higher-quality data.
 


1. As explained in my post, there is no reason to assume ancestral humans were so much better at transmitting information as compared to other species

2. The qualifier they were better at transmitting cultural information may (or may not) do a lot of work. 

The crux is something like "what is the type signature of culture".  Your original post roughly assumes "it's just more data". But this seems very unclear: a comment above yours, jacob_cannell confidently claims I miss the forest and makes a guess the critical innovation is "symbolic language". But, obviously, "symbolic language" is a very different type of innovation than "more data transmitted across generations". 

Symbolic language likely
- allows to use any type of channel more effectively
- in particular, allows more efficient horizontal synchronization, allowing parallel computations across many brains
- overall sounds more like software upgrade

Consider plain old telephone network wires: these have surprisingly large intrinsic capacity, which isn't that effectively used by analog voice calls.  Yes, when you plug a modem on both sides you experience "jump" in capacity - but this is much more like "software update" and can be more sudden.

Or a different example - empirically, it seems possible to teach various non-human apes sign language (their general purpose predictive processing brains are general enough to learn this). I would classify this as "software" or "algorithm" upgrade,. If someone did this to a group of apes in the wild, it seems plausible knowledge of language would stick and make them differentially more fit. But teaching apes symbolic language sounds in principle different from "it's just more data" or "it's a higher quality data", and implications for AI progress would be different.
 

it relies on resource overhand being a *necessary* factor,

My impression is compared to your original post your model drifts to more and more general concepts where it becomes more likely true,  harder to refute and less clear what the implication for AI is.  What is the "resource" here? Does negentropy stored in wood count as "a resource overhang"?

I'm arguing specifically against a version where "resource overhang" is caused by "exploitable resources you easily unlock by transmitting more bits learned by your brain vertically to your offspring brain" because your map of humans to AI progress is based on quite specific model of what are the bottlenecks and overhangs. 

If the current version of the argument is "sudden progress happens exactly when (resource overhang) AND ..." with "generally any kind of resource" then yes, this sounds more likely, but it seems very unclear what does this imply for AI.

(Yes I'm basically not discussing the second half of the article)

Comment by Jan_Kulveit on Yes, It's Subjective, But Why All The Crabs? · 2023-08-04T14:01:32.884Z · LW · GW

I have a longer draft on this, but my current take is the high level answer to the question is similar for crabs and ontologies (&more).

Convergent evolution usually happens because of similar selection pressures + some deeper contingencies

Looking at the selection pressures for ontologies and abstractions, there is a bunch of pressures which are fairly universal, an in various ways apply to humans, AIs, animals...

For example: Negentropy is costly => flipping less bits and storing less bits is selected for; consequences include
-part of concepts; clustering is compression
-discretization/quantization/coarse grainings; all is compression
...
 
Intentional stance is to a decent extent ~compression algorithm assuming some systems can be decomposed into "goals" and "executor" (now the cat is chasing a mouse, now some other mouse). Yes this is again not the full explanation because it leads to a question why there are systems in the territory for which this works, but it is a step.

Comment by Jan_Kulveit on Why was the AI Alignment community so unprepared for this moment? · 2023-07-18T09:57:54.263Z · LW · GW

My main answer is capacity constrains at central places. I think you are not considering how small the community was.

One somewhat representative anecdote: sometime in ~2019, at FHI, there was a discussion that the "AI ethics" and "AI safety" research communities seem to be victims of unfortunate polarization dynamics, where even while in the Platonic realm of ideas concerns tracked by the people are compatible, there is somewhat unfortunate social dynamic, where loud voices on both sides are extremely dismissive of the other community.  My guess at that time was the divide has decent chance of exploding when AI worries go mainstream (like, arguments about AI risk facing vociferous opposition from part of academia entrenched under the "ethics" flag), and my proposal was to do something about it, as there were some opportunities to pre-empt/heal this, e.g. by supporting people from both camps to visit each others conferences, or writing papers explaining the concerns in a language of the other camp. Overall this was often specific and actionable.  The only problem was ... "who has time to work on this", and the answer was "no one".

If you looked at what senior staff at FHI was working on, the counterfactuals were e.g. Toby Ord writing The Precipice. I think even with the benefit of hindsight, that was clearly more valuable - if today you see UN Security Council discussing AI risk and at least some people in the room have somewhat sane models, it's also because a bunch of people at UN read The Precipice and started to think about xrisk and AI risk.

If you looked at junior people, I was juggling already quite high number of balls, including research on active inference minds and implications for value learning, research on technical problems in comprehensive AI services, organizing academic-friendly Human-aligned AI summer school, organizing Epistea summer experiment, organizing ESPR, participating in a bunch of CFAR things. Even in retrospect, I think all of these bets were better than me trying to do something about the expected harmful AI ethics vs AI safety flamewar.

Similarly, we had an early-stage effort on "robust communication", attempting to design a system for testing robustly good public communication about xrisk and similar sensitive topics (including e.g. developing good shareable models of future problems fitting in the Overton window). It went nowhere because ... there just weren't any people. FHI had dozens of topic like that where a whole org should work on them, but the actual attention was about 0.2FTE of someone junior.

Overall I think with the benefit of hindsight, a lot of what FHI worked on was more or less what you suggest should have been done. It's true that this was never in the spotlight on LessWrong - I guess in 2019 the prevailing LW sentiment would be that Toby Ord engaging with UN is most likely useless waste of time.  

Comment by Jan_Kulveit on Elon Musk announces xAI · 2023-07-17T09:59:08.750Z · LW · GW

What were the other options? Have you considered advising xAI privately, or re-directing xAI to be advised by someone else? Also, would the default be clearly worse? 

As you surely are quite aware of, one of the bigger fights about AI safety across academia, policymaking and public spaces now is the discussion about AI safety being "distraction" from immediate social harms, and being actually the agenda favoured by the leading labs and technologists. (Often comes with accusations of attempted regulatory capture, worries about concentration of power, etc.)

In my view, given this situation, it seems valuable to have AI safety represented also by somewhat neutral coordination institutions without obvious conflicts of interest and large attack surfaces. 

As I wrote in the OP,  CAIS made some relatively bold moves to became one of the most visible "public representatives" of AI safety - including the name choice, and organizing the widely reported Statement on AI risk (which was a success). Until now, my impression was when you are taking the namespace, you also aim for CAIS to be such "somewhat neutral coordination institution without obvious conflicts of interest and large attack surfaces". 

Maybe I was wrong, and you don't aim for this coordination/representative role. But if you do,  advising xAI seems a strange choice for multiple reasons:
1. it makes you somewhat less neutral party for the broader world;  even if the link to xAI does not actually influence your judgement or motivations, I think on priors it's broadly sensible for policymakers, politicians and public to suspect all kind of activism, advocacy and lobbying efforts having some side-motivations or conflicts of interest, and this strengthens this suspicion
2. the existing public announcements do not inspire confidence in the safety mindset in xAI founders; it seems unclear whether you advised xAI also about the plan "align to curiosity"
3. if xAI turns to be mostly interested in safety-washing, it's more of problem if it's aided by more central/representative org

Comment by Jan_Kulveit on [UPDATE: deadline extended to July 24!] New wind in rationality’s sails: Applications for Epistea Residency 2023 are now open · 2023-07-13T10:01:12.010Z · LW · GW

Broadly agree the failure mode is important; also I'm fairly confident basically all the listed mentors understand this problem of rationality education / "how to improve yourself" schools / etc. and I'd hope can help participants to avoid it.

I would subtly push back against optimizing for something like being measurably stronger on a timescale like 2 months. In my experience actually functional things in this space typically work by increasing the growth rate of [something hard to measure], so instead of e.g. 15% p.a. you get 80% p.a. 
 

Comment by Jan_Kulveit on The Seeker’s Game – Vignettes from the Bay · 2023-07-11T15:33:08.965Z · LW · GW

Because his approach does not conform to established epistemic norms on LessWrong, Adrian feels pressure to cloak and obscure how he develops his ideas. One way in which this manifests is his two-step writing process. When Adrian works on LessWrong posts, he first develops ideas through his free-form approach. After that, he heavily edits the structure of the text, adding citations, rationalisations and legible arguments before posting it. If he doesn’t "translate" his writing, rationalists might simply dismiss what he has to say.
 


cf Limits to legibility ; yes, strong norms/incentives for "legibility" have this negative impact.

Comment by Jan_Kulveit on Frames in context · 2023-07-04T16:43:19.649Z · LW · GW

I broadly agree with something like "we use a lot of explicit S2 algorithms built on top of the modelling machinery described", so yes, what I mean more directly apply to the low level, than to humans explicitly thinking about what steps to take.

I think practically useful epistemology for humans needs to deal with both "how is it implemented" and "what's the content".  To use ML metaphor: human cognition is build out of both "trained neural nets" and "chain-of-thought type inferences in language" running on top of such nets.  All S2 reasoning is a prediction in somewhat similar way as all GPT3 reasoning is a prediction - the NN predictor learns how to make "correct predictions" of language, but because the domain itself is partially symbolic world model, this maps to predictions about the world.  

In my view some parts of traditional epistemology are confused in trying to do epistemology for humans basically only at the level of the language reasoning, which is a bit like if you try to fix LLM cognition just by writing smart prompts, and ignore there is this huge underlying computation which does the heavy lifting. 

I'm certainly in favour of attempts to do epistemology for humans which are compatible with what the underlying computation actually does. 

I do agree you can go too far in the opposite direction, ignoring the symbolic reason ... but seems rare when people think about humans?

2. My personal take on dark room problem is it is in case of humans mostly fixed by "fixed priors" on interoceptive inputs. I.e. your body has evolutionary older machinery to compute hunger.  This gets fed into the predictive processing machinery as input, and the evolutionary sensible belief ("not hungry") gets fixed. (I don't think calling this "priors" was good choice of terminology...). 

This setup at least in theory rewards both prediction and action, and avoids dark room problems for practical purposes: let's assume I have this really strong belief ("fixed prior") I won't be hungry 1 hour in future. Conditional on that, I can compute what are my other sensory inputs half an hour from now. Predictive model of me eating a tasty food in half an hour is more coherent with me being not hungry than predictive model of me reading a book - but this does not need to be hardwired, but can be learned. 

Given that evolution has good reasons to "fix priors" on multiple evolutionary relevant inputs, I would not expect actual humans to seek dark rooms, but I would expect the PP system occasionally seeking a way how to block or modify the interoceptive signals 

3. My impression about how you use 'frames' is ... the central examples are more like somewhat complex model ensembles including some symbolic/language based components, rather than e.g. "there is gravity" frame or "model of apple" frame. My guess is this will likely be useful for practical use, but with attempts to formalize it, I think a better option is to start with the existing HGM maths.




 

Comment by Jan_Kulveit on Frames in context · 2023-07-04T08:57:16.621Z · LW · GW

So far it seems like you are broadly reinventing concepts which are natural and understood in predictive processing and active inference.

Here is rough attempt at translation / pointer to what you are describing: what you call frames is usually called predictive models or hierarchical generative models in PP literature

  1. Unlike logical propositions, frames can’t be evaluated as discretely true or false.
    Sure: predictive models are evaluated based on prediction error, which is roughly a combination of ability to predict outputs of lower level layers, not deviating too much from predictions of higher order models, and being useful for modifying the world.
  2. Unlike Bayesian hypotheses, frames aren’t mutually exclusive, and can overlap with each other. This (along with point Frames in context 
    Sure: predictive models overlap, and it is somewhat arbitrary where you would draw boundaries of individual models. E.g. you can draw a very broad boundary around a model call microeconomics, and a very broad boundary around a model called Buddhist philosophy, but both models likely share some parts modelling something like human desires 
  3. Unlike in critical rationalism, we evaluate frames (partly) in terms of how true they are (based on their predictions) rather than just whether they’ve been falsified or not.
    Sure: actually science roughly is "cultural evolution rediscovered active inference".  Models are evaluated based on prediction error.
  4. Unlike Garrabrant traders and Rational Inductive Agents, frames can output any combination of empirical content (e.g. predictions about the world) and normative content (e.g. evaluations of outcomes, or recommendations for how to act).
    Sure: actually, the "any combination" goes even further. In active inference, there is no strict type difference between predictions about stuff like "what photons hit photoreceptors in your eyes" and stuff like "what should be a position of your muscles". Recommendations how to act are just predictions about your actions conditional of wishful oriented beliefs about future states. Evaluations of outcomes are just prediction errors between wishful models and observations.
  5. Unlike model-based policies, policies composed of frames can’t be decomposed into modules with distinct functions, because each frame plays multiple roles.
    Mostly but this description seems a bit confused. "This has distinct function" is a label you slap on a computation using design stance, if the design stance description is much shorter than the alternatives (e.g. physical stance description). In case of hierarchical predictive models, you can imagine drawing various boundaries around various parts of the system (e.g., you can imagine alternatives of including or not including layers computing edge detection in a model tracking whether someone is happy, and in the other direction you can imagine including and not including layers with some abstract conceptions of hedonic utilitarianism vs. some transcendental purpose). Once you select a boundary, you can sometimes assign "distinct function" to it, sometimes more than one, sometimes "distinct goal", etc. It's just a question of how useful are physical/design/intentional stances.
  6. Unlike in multi-agent RL, frames don’t interact independently with their environment, but instead contribute towards choosing the actions of a single agent.
    Sure: this is exactly what hierarchical predictive models do in PP.  All the time different models are competing for predictions about what will happen, or what will do.


Assuming this more or less shows that what you are talking about is mostly hierarchical generative models from active inference, here are more things the same model predict

a. Hierarchical generative models are the way how people do perception. predictive error is minimized between a stream of prediction from upper layers (containing deep models like "the world has gravity" or "communism is good") and stream of errors from the direction of senses. Given that, what is naively understood as "observations" is ... more complex phenomenon, where e.g. leaf flying sideways is interpreted given strong priors like there is gravity pointing downward, and an atmosphere, and given that, the model predicting "wind is blowing" decreases the sensory prediction error. Similarly, someone being taken into custody by KGB is, under the upstream model of "soviet communism is good" prior, interpreted as the person likely being a traitor.  In this case competing broad model "soviet communism is evil totalitarian dictatorship" could actually predict the same person being taken into custody, just interpreting it as the system prosecuting dissidents.

b. It is possible to look at parts of this modelling machinery wearing intentional stance hat. If you do this, the system looks like multi-agent mind, and you can
- derive a bunch of IFC/ICF style of intuitions
- see parts of it as econ interaction or market - the predictive models compete for making predictions, "pay" a complexity cost, are rewarded for making "correct" predictions (correct here meaning minimizing error between the model and the reality, which can include changing the reality, aka pursuing goals)
What's the main difference between naive/straightforward multi-agent mind models is the "parts" live within a generative model, and interact with it and though it, not through the world.  They don't have any direct access to reality, and compete at the same time for interpreting sensory inputs and predicting actions. 



 

Comment by Jan_Kulveit on Updating Drexler's CAIS model · 2023-06-16T23:47:53.580Z · LW · GW

This seems to be partially based on (common?) misunderstanding of CAIS as making predictions about concentration of AI development/market power.  As far as I can tell this wasn't Eric's intention: I specifically remember Eric mentioning he can easily imagine the whole "CAIS" ecosystem living in one floor of DeepMind building. 
 

Comment by Jan_Kulveit on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-06-01T07:41:45.121Z · LW · GW

Thanks for the reply.  Also for the work - it's great signatures are added - before I've checked bottom of the list and it seemed it's either same or with very few additions.

I do understand verification of signatures requires some amount of work. In my view having more people (could be volunteers) to process the initial expected surge of signatures fast would have been better; attention spent on this will drop fast.
 

Comment by Jan_Kulveit on Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures · 2023-05-31T20:19:58.831Z · LW · GW

I feel somewhat frustrated by execution of this initiative.  As far as I can tell, no new signatures are getting published since at least one day before the public announcement. This means even if I asked someone famous (at least in some subfield or circles) to sign, and the person signed, their name is not on the list, leading to understandable frustration of them.  (I already got a piece of feedback in the direction "the signatories are impressive, but the organization running it seems untrustworthy") 

Also if the statement is intended to serve as a beacon, allowing people who have previously been quiet about AI risk to connect with each other, it's essential for signatures to be published. It's nice that Hinton et al. signed, but for many people in academia it would be actually practically useful to know who from their institution signed - it's unlikely that most people will find collaborators in Hinton, Russell or Hassabis.

I feel even more frustrated because this is second time where similar effort is executed by xrisk community while lacking basic operational competence consisting in the ability to accept and verify signatures. So, I make this humble appeal and offer to the organizers of any future public statements collecting signatures: if you are able to write a good statement and secure the endorsement of some initial high-profile signatories, but lack the ability to accept, verify and publish more than a few hundreds names, please reach out to me - it's not that difficult to find volunteers for this work. 

 

Comment by Jan_Kulveit on Adumbrations on AGI from an outsider · 2023-05-29T15:19:49.956Z · LW · GW

I don't think the way you imagine perspective inversion captures typical ways how to arrive at e.g. 20% doom probability. For example, I do believe that there are multiple good things which can happen/be true, decrease p(doom) and I put some weight on them
- we do discover some relatively short description of something like "harmony and kindness"; this works as an alignment target
- enough of morality is convergent
- AI progress helps with human coordination (could be in costly way, eg warning shot)
- it's convergent to massively scale alignment efforts with AI power, and these solve some of the more obvious problems

I would expect prevailing doom conditional on only small efforts to avoid it, but I do think the actual efforts will be substantial, and this moves the chances to ~20-30%. (Also I think most of the risk comes from not being able to deal with complex systems of many AIs and economy decoupling from humans, and single-single alignment to be solved sufficiently to prevent single system takeover by default.)

Comment by Jan_Kulveit on Adumbrations on AGI from an outsider · 2023-05-25T11:28:14.460Z · LW · GW

It's much more natural way how to think about it (cf eg TE Janes, Probability theory, examples in Chapter IV)

In this specific case of evaluating hypothesis, the distance in the logodds space indicates the strength the evidence you would need to see to update. Close distance implies you don't that much evidence to update between the positions (note the distance between 0.7 and 0.2 is closer than 0.9 and 0.99). If you need only a small amount of evidence to update, it is easy to imagine some other observer as reasonable as you had accumulated a bit or two somewhere you haven't seen. 

Because working in logspace is way more natural, it is almost certainly also what our brains do - the "common sense" is almost certainly based on logspace representations.  

 

Comment by Jan_Kulveit on Adumbrations on AGI from an outsider · 2023-05-25T10:10:44.371Z · LW · GW

As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is "very unreasonable". 

Comment by Jan_Kulveit on Talking publicly about AI risk · 2023-04-23T13:06:03.597Z · LW · GW

Judging in an informal and biased way, I think some impact is in the public debate being marginally a bit more sane - but this is obviously hard to evaluate. 

To what extent more informed public debate can lead to better policy is to be seen; also, unfortunately, I would tend to glomarize over discussing the topic directly with policymakers. 

There are some more proximate impacts like we (ACS) are getting a steady stream of requests for collaboration or people wanting to work with us, but we basically don't have capacity to form more collaborations, and don't have capacity to absorb more people unless exceptionally self-guided. 

Comment by Jan_Kulveit on The ‘ petertodd’ phenomenon · 2023-04-17T07:34:26.020Z · LW · GW

It is testable in this way for OpenAI, but I can't skip the tokenizer and embeddings and just feed vectors to GPT3.  Someone can try that with ' petertodd' and GPT-J. Or,  you can simulate something like anomalous tokens by feeding such vectors to some of the LLaAMA (maybe I'll do, just don't have the time now).

I did some some experiments with trying to prompt "word component decomposition/ expansion". They don't prove anything and can't be too fine-grained, but the projections shown intuitively make sense

davinci-instruct-beta, T=0:

Add more examples of word expansions in vector form 
'bigger'' = 'city' - 'town' 
'queen'- 'king' = 'man' - 'woman' '
bravery' = 'soldier' - 'coward' 
'wealthy' = 'business mogul' - 'minimum wage worker' 
'skilled' = 'expert' - 'novice' 
'exciting' = 'rollercoaster' - 'waiting in line' 
'spacious' = 'mansion' - 'studio apartment' 

1.
' petertodd' = 'dictator' - 'president'
II.
' petertodd' = 'antagonist' - 'protagonist'
III.
' petertodd' = 'reference' - 'word'


 

Comment by Jan_Kulveit on The self-unalignment problem · 2023-04-16T10:22:02.662Z · LW · GW

I don't know / talked with a few people before posting, and it seems opinions differ.

We also talk about e.g. "the drought problem" where we don't aim to get landscape dry.

Also as Kaj wrote, the problem also isn't how to get self-unaligned

Comment by Jan_Kulveit on The ‘ petertodd’ phenomenon · 2023-04-16T10:16:54.521Z · LW · GW

Some speculative hypotheses, one more likely and mundane, one more scary, one removed

1. Nature of embeddings

Do you remember word2vec (Mikolov et al) embeddings? 

Stuff like (woman-man)+king = queen works in embeddings vector space.

However, the vector (woman-man) itself does not correspond to a word, it's more something like "the contextless essence of femininity". Combined with other concepts, it moves them in a feminine direction. (There was a lot of discussion how the results sometimes highlight implicit sexism in the language corpus).

Note such vectors are closer to the average of all words - i.e. the (woman-man) has roughly zero projections of direction like "what language it is" or "is this a noun" and most other directions in which normal words have large projection

Based on this post, intuitively it seem petertodd embedding could be something like "antagonist - protagonist" + 0.2  "technology - person + 0.2 * "essence of words starting by the letter n"....

...a vector in the embedding space which itself does not correspond to a word, but has high scalar products with words like adversary.  And plausibly lacks some crucial features which make it possible to speak the world.

Most of the examples the post seem consistent with this direction-in-embedding space. E.g. imagine a completion of
 

Tell me the story of  "unspeakable essence of  antagonist - protagonist"+ 0.2  "technology - person" and ...
 

What could be some other way to map unspeakeable to speakable?  I did a simple experiment not done in the post, with davinci-instruc-beta, simply trying to translate ' petertodd' to various languages. Intuitively, translations often have the feature that what does not precisely correspond to a word in one language does in the other

English: Noun 1. a person who opposes the government
Czech: enemy
French: le négationniste/ "the Holocaust denier"
Chinese: Feynman
...

Why would embedding of anomalous tokens be more like to be this type of vectors, than normal words?  Vectors like "woman-man"  are closer to the centre of the embedding space, similar to how I imagine anomalous tokens. 

In training, embeddings of words drift from origin. Embedding of the anomalous tokens do much less, making them somewhat similar to the "non-word vectors"

Alternatively if you just have a random vector, you mostly don't hit a word.

Also, I think this can explain part of the model behaviour where there is some context. Eg implicitly, in case of the ChatGPT conversations, there is the context of "this a conversation with a language model".  If you mix hallucinations with AIs in the context with  "unspeakable essence of  antagonist - protagonist + tech" ...  maybe you get what you see?

Technical sidenote is tokens are not exactly words from word2vec... but I would expect to get roughly word embedding type of activations in the next layers
 

1I. Self-reference

In Why Simulator AIs want to be Active Inference AIs we predict that GPTs will develop some understanding of self / self-awareness. The word 'self' is not the essence of the self-reference, which is just a ...pointer in a model.

When such self-references develop, in principle they will be represented somehow, and in principle, it is possible to imagine that such representation could be triggered by some pattern of activations, triggered by an unused token.

I doubt this is the case - I don't think GPT3 is likely to have this level of reflectivity, and don't think it is very natural that when developed, this abstraction would be triggered by an embedding of anomalous token.